A privacy-focused, local Python framework that ingests structured chat exports (WhatsApp .txt, .pdf, .docx) and extracts tagged content — into a clean, readable Markdown file.
No external APIs. No data leaves your machine. Pure rule-based, local processing.
Group chats used for structured learning (e.g. daily coding challenges, study groups, revision sessions) accumulate hundreds of messages over time. Manually sifting through chat exports to find specific content is tedious and error-prone.
This tool solves that by:
- Parsing raw chat exports automatically
- Filtering messages by date range
- Scoring messages using configurable heuristics to identify relevant content
- Exporting only the matched content into a structured Markdown document
It is designed to be generic and extensible — the scoring signals, system message keywords, and date ranges are all configurable via config.py, making it adaptable to any structured group chat workflow.
document_parser/
├── main.py # CLI entry point
├── config.py # All constants: regexes, thresholds, keyword lists
├── requirements.txt # Optional dependencies
├── models/
│ └── record.py # Record dataclass shared across all layers
├── ingestion/
│ ├── base_adapter.py # Abstract base class for adapters
│ ├── txt_adapter.py # Primary adapter — handles .txt exports
│ ├── pdf_adapter.py # Delegates to TxtAdapter via pdfplumber
│ └── docx_adapter.py # Delegates to TxtAdapter via python-docx
├── temporal/
│ └── validator.py # Date parsing and range filtering
├── classifier/
│ └── heuristic.py # Weighted scoring classifier
└── exporter/
└── markdown_exporter.py # Groups results by day and writes Markdown
- Python 3.8+
- No dependencies required for
.txtparsing (stdlib only) - Optional dependencies for other formats:
pip install -r requirements.txt| Package | Required for |
|---|---|
pdfplumber |
.pdf input files |
python-docx |
.docx input files |
python3 main.py --input <path-to-chat-export> --output <output.md>| Flag | Default | Description |
|---|---|---|
--input |
(required) | Path to the input file (.txt, .pdf, .docx) |
--output |
problems.md |
Path for the output Markdown file |
--from-date |
2025-01-01 |
Filter messages from this date (YYYY-MM-DD) |
--to-date |
2026-12-31 |
Filter messages up to this date (YYYY-MM-DD) |
--threshold |
5 |
Minimum score for a message to be included |
Basic usage:
python3 main.py --input Whatsapp_Document.txt --output problems.mdWith date range:
python3 main.py --input Whatsapp_Document.txt --output problems.md \
--from-date 2025-06-01 --to-date 2025-12-31Lower threshold (capture more messages):
python3 main.py --input Whatsapp_Document.txt --output problems.md --threshold 3The pipeline has four stages:
Adapter → TemporalValidator → FeatureClassifier → MarkdownExporter
-
Ingestion — The adapter reads the file line by line, buffers multi-line messages, and separates system messages from user messages. Handles WhatsApp timestamp format including unicode whitespace.
-
Temporal Validation — Parses
DD/MM/YYdates, handles 2-digit year expansion (25→2025), and filters messages outside the configured date range. -
Classification — Each message is scored using weighted regex signals. Messages that meet the threshold are tagged with a day number and type. System messages are discarded before scoring.
-
Export — Matched messages are grouped by day number, sorted, and written to Markdown with WhatsApp
*bold*converted to Markdown**bold**.
| Pattern | Weight |
|---|---|
Day \d+ |
5 |
Test Case |
3 |
Revision for the day |
3 |
Input: / Output: |
2 each |
| DS/Algo keywords (Array, Stack, Graph, etc.) | 1 each |
Default threshold: 5 (a message with just Day 36 already qualifies).
# Day 36 — Problem Solving For The Day
**Date:** 2025-06-30
**Sender:** xyz Sir
## Problem
Given a number, convert it into the form of words...
## Test Cases
**Test Case 1:**
- Input: `7824`
- Output: `seven thousand eight hundred twenty four`All tunable values live in config.py:
SYSTEM_KEYWORDS— strings that identify non-user system messages (e.g. "joined using", "created group")POSITIVE_SIGNALS— regex/weight pairs used for scoringSCORE_THRESHOLD— minimum score to include a messageDEFAULT_FROM_DATE/DEFAULT_TO_DATE— default date range
To adapt this for a different chat format or content type, update these values — no code changes needed.
After running, you can sanity-check the output:
# Count how many Day entries were extracted
grep -c "^# Day" problems.md
# Spot-check a specific day
grep -A 20 "^# Day 36" problems.md
# Confirm no system noise leaked through
grep "joined using\|end-to-end\|disappearing" problems.mdThis tool runs entirely offline. Your chat data is never sent anywhere — all parsing and classification happens locally using Python's standard library and optional local packages.