Generic Local Document Parser

A privacy-focused, local Python framework that ingests structured chat exports (WhatsApp .txt, .pdf, .docx) and extracts tagged content — into a clean, readable Markdown file.

No external APIs. No data leaves your machine. Pure rule-based, local processing.

Why This Project

Group chats used for structured learning (e.g. daily coding challenges, study groups, revision sessions) accumulate hundreds of messages over time. Manually sifting through chat exports to find specific content is tedious and error-prone.

This tool solves that by:

Parsing raw chat exports automatically
Filtering messages by date range
Scoring messages using configurable heuristics to identify relevant content
Exporting only the matched content into a structured Markdown document

It is designed to be generic and extensible — the scoring signals, system message keywords, and date ranges are all configurable via config.py, making it adaptable to any structured group chat workflow.

Project Structure

document_parser/
├── main.py                    # CLI entry point
├── config.py                  # All constants: regexes, thresholds, keyword lists
├── requirements.txt           # Optional dependencies
├── models/
│   └── record.py              # Record dataclass shared across all layers
├── ingestion/
│   ├── base_adapter.py        # Abstract base class for adapters
│   ├── txt_adapter.py         # Primary adapter — handles .txt exports
│   ├── pdf_adapter.py         # Delegates to TxtAdapter via pdfplumber
│   └── docx_adapter.py        # Delegates to TxtAdapter via python-docx
├── temporal/
│   └── validator.py           # Date parsing and range filtering
├── classifier/
│   └── heuristic.py           # Weighted scoring classifier
└── exporter/
    └── markdown_exporter.py   # Groups results by day and writes Markdown

Requirements

Python 3.8+
No dependencies required for .txt parsing (stdlib only)
Optional dependencies for other formats:

pip install -r requirements.txt

Package	Required for
`pdfplumber`	`.pdf` input files
`python-docx`	`.docx` input files

Usage

python3 main.py --input <path-to-chat-export> --output <output.md>

All Options

Flag	Default	Description
`--input`	(required)	Path to the input file (`.txt`, `.pdf`, `.docx`)
`--output`	`problems.md`	Path for the output Markdown file
`--from-date`	`2025-01-01`	Filter messages from this date (YYYY-MM-DD)
`--to-date`	`2026-12-31`	Filter messages up to this date (YYYY-MM-DD)
`--threshold`	`5`	Minimum score for a message to be included

Examples

Basic usage:

python3 main.py --input Whatsapp_Document.txt --output problems.md

With date range:

python3 main.py --input Whatsapp_Document.txt --output problems.md \
                --from-date 2025-06-01 --to-date 2025-12-31

Lower threshold (capture more messages):

python3 main.py --input Whatsapp_Document.txt --output problems.md --threshold 3

How It Works

The pipeline has four stages:

Adapter → TemporalValidator → FeatureClassifier → MarkdownExporter

Ingestion — The adapter reads the file line by line, buffers multi-line messages, and separates system messages from user messages. Handles WhatsApp timestamp format including unicode whitespace.
Temporal Validation — Parses DD/MM/YY dates, handles 2-digit year expansion (25 → 2025), and filters messages outside the configured date range.
Classification — Each message is scored using weighted regex signals. Messages that meet the threshold are tagged with a day number and type. System messages are discarded before scoring.
Export — Matched messages are grouped by day number, sorted, and written to Markdown with WhatsApp *bold* converted to Markdown **bold**.

Scoring Signals (configurable in `config.py`)

Pattern	Weight
`Day \d+`	5
`Test Case`	3
`Revision for the day`	3
`Input:` / `Output:`	2 each
DS/Algo keywords (Array, Stack, Graph, etc.)	1 each

Default threshold: 5 (a message with just Day 36 already qualifies).

Output Format

# Day 36 — Problem Solving For The Day
**Date:** 2025-06-30
**Sender:** xyz Sir

## Problem
Given a number, convert it into the form of words...

## Test Cases
**Test Case 1:**
- Input: `7824`
- Output: `seven thousand eight hundred twenty four`

Customization

All tunable values live in config.py:

SYSTEM_KEYWORDS — strings that identify non-user system messages (e.g. "joined using", "created group")
POSITIVE_SIGNALS — regex/weight pairs used for scoring
SCORE_THRESHOLD — minimum score to include a message
DEFAULT_FROM_DATE / DEFAULT_TO_DATE — default date range

To adapt this for a different chat format or content type, update these values — no code changes needed.

Verification

After running, you can sanity-check the output:

# Count how many Day entries were extracted
grep -c "^# Day" problems.md

# Spot-check a specific day
grep -A 20 "^# Day 36" problems.md

# Confirm no system noise leaked through
grep "joined using\|end-to-end\|disappearing" problems.md

Privacy

This tool runs entirely offline. Your chat data is never sent anywhere — all parsing and classification happens locally using Python's standard library and optional local packages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generic Local Document Parser

Why This Project

Project Structure

Requirements

Usage

All Options

Examples

How It Works

Scoring Signals (configurable in `config.py`)

Output Format

Customization

Verification

Privacy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
classifier		classifier
exporter		exporter
ingestion		ingestion
models		models
temporal		temporal
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Generic Local Document Parser

Why This Project

Project Structure

Requirements

Usage

All Options

Examples

How It Works

Scoring Signals (configurable in config.py)

Output Format

Customization

Verification

Privacy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Scoring Signals (configurable in `config.py`)

Packages