Skip to content

ThunderShadows/Generic-Local-Document-Parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generic Local Document Parser

A privacy-focused, local Python framework that ingests structured chat exports (WhatsApp .txt, .pdf, .docx) and extracts tagged content — into a clean, readable Markdown file.

No external APIs. No data leaves your machine. Pure rule-based, local processing.


Why This Project

Group chats used for structured learning (e.g. daily coding challenges, study groups, revision sessions) accumulate hundreds of messages over time. Manually sifting through chat exports to find specific content is tedious and error-prone.

This tool solves that by:

  • Parsing raw chat exports automatically
  • Filtering messages by date range
  • Scoring messages using configurable heuristics to identify relevant content
  • Exporting only the matched content into a structured Markdown document

It is designed to be generic and extensible — the scoring signals, system message keywords, and date ranges are all configurable via config.py, making it adaptable to any structured group chat workflow.


Project Structure

document_parser/
├── main.py                    # CLI entry point
├── config.py                  # All constants: regexes, thresholds, keyword lists
├── requirements.txt           # Optional dependencies
├── models/
│   └── record.py              # Record dataclass shared across all layers
├── ingestion/
│   ├── base_adapter.py        # Abstract base class for adapters
│   ├── txt_adapter.py         # Primary adapter — handles .txt exports
│   ├── pdf_adapter.py         # Delegates to TxtAdapter via pdfplumber
│   └── docx_adapter.py        # Delegates to TxtAdapter via python-docx
├── temporal/
│   └── validator.py           # Date parsing and range filtering
├── classifier/
│   └── heuristic.py           # Weighted scoring classifier
└── exporter/
    └── markdown_exporter.py   # Groups results by day and writes Markdown

Requirements

  • Python 3.8+
  • No dependencies required for .txt parsing (stdlib only)
  • Optional dependencies for other formats:
pip install -r requirements.txt
Package Required for
pdfplumber .pdf input files
python-docx .docx input files

Usage

python3 main.py --input <path-to-chat-export> --output <output.md>

All Options

Flag Default Description
--input (required) Path to the input file (.txt, .pdf, .docx)
--output problems.md Path for the output Markdown file
--from-date 2025-01-01 Filter messages from this date (YYYY-MM-DD)
--to-date 2026-12-31 Filter messages up to this date (YYYY-MM-DD)
--threshold 5 Minimum score for a message to be included

Examples

Basic usage:

python3 main.py --input Whatsapp_Document.txt --output problems.md

With date range:

python3 main.py --input Whatsapp_Document.txt --output problems.md \
                --from-date 2025-06-01 --to-date 2025-12-31

Lower threshold (capture more messages):

python3 main.py --input Whatsapp_Document.txt --output problems.md --threshold 3

How It Works

The pipeline has four stages:

Adapter → TemporalValidator → FeatureClassifier → MarkdownExporter
  1. Ingestion — The adapter reads the file line by line, buffers multi-line messages, and separates system messages from user messages. Handles WhatsApp timestamp format including unicode whitespace.

  2. Temporal Validation — Parses DD/MM/YY dates, handles 2-digit year expansion (252025), and filters messages outside the configured date range.

  3. Classification — Each message is scored using weighted regex signals. Messages that meet the threshold are tagged with a day number and type. System messages are discarded before scoring.

  4. Export — Matched messages are grouped by day number, sorted, and written to Markdown with WhatsApp *bold* converted to Markdown **bold**.

Scoring Signals (configurable in config.py)

Pattern Weight
Day \d+ 5
Test Case 3
Revision for the day 3
Input: / Output: 2 each
DS/Algo keywords (Array, Stack, Graph, etc.) 1 each

Default threshold: 5 (a message with just Day 36 already qualifies).


Output Format

# Day 36 — Problem Solving For The Day
**Date:** 2025-06-30
**Sender:** xyz Sir

## Problem
Given a number, convert it into the form of words...

## Test Cases
**Test Case 1:**
- Input: `7824`
- Output: `seven thousand eight hundred twenty four`

Customization

All tunable values live in config.py:

  • SYSTEM_KEYWORDS — strings that identify non-user system messages (e.g. "joined using", "created group")
  • POSITIVE_SIGNALS — regex/weight pairs used for scoring
  • SCORE_THRESHOLD — minimum score to include a message
  • DEFAULT_FROM_DATE / DEFAULT_TO_DATE — default date range

To adapt this for a different chat format or content type, update these values — no code changes needed.


Verification

After running, you can sanity-check the output:

# Count how many Day entries were extracted
grep -c "^# Day" problems.md

# Spot-check a specific day
grep -A 20 "^# Day 36" problems.md

# Confirm no system noise leaked through
grep "joined using\|end-to-end\|disappearing" problems.md

Privacy

This tool runs entirely offline. Your chat data is never sent anywhere — all parsing and classification happens locally using Python's standard library and optional local packages.

About

A privacy-focused, local Python framework capable of ingesting unstructured or semi-structured documents to filter, categorize, and export specific data segments based on temporal (date) and semantic (content) rules.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages