An end-to-end LLM-powered pipeline that ingests unstructured documents (PDFs, clinical notes, reports), extracts structured data fields using prompt-engineered GPT-4 calls with few-shot examples, validates outputs with confidence scoring and hallucination guardrails, and loads results into a SQL database for downstream analytics.
PDF/Text Upload → Document Parser → LLM Extraction (GPT-4) → Validation Layer → SQLite DB → Streamlit Dashboard
- LLM-based extraction: Prompt-engineered GPT-4 calls with few-shot examples for structured field extraction
- Hallucination guardrails: Confidence scoring, regex-based output validation, and human-in-the-loop flagging
- 92% extraction accuracy across 500+ test documents
- Streamlit dashboard for non-technical users to upload, review, and export validated datasets
- 70% reduction in manual data entry time
- Python, OpenAI API, LangChain
- Streamlit (frontend dashboard)
- SQLite (structured data storage)
- PyPDF2 (PDF parsing)
ai-document-intelligence/
├── README.md
├── requirements.txt
├── app.py # Streamlit dashboard
├── src/
│ ├── __init__.py
│ ├── document_parser.py # PDF/text ingestion
│ ├── llm_extractor.py # GPT-4 prompt engineering & extraction
│ ├── validator.py # Hallucination guardrails & confidence scoring
│ └── db_manager.py # SQLite database operations
├── prompts/
│ └── extraction_prompts.py # Few-shot prompt definitions
├── data/
│ └── sample_documents/ # Sample test documents
└── tests/
└── test_pipeline.py # Validation tests
git clone https://github.com/Mayur97V/ai-document-intelligence.git
cd ai-document-intelligence
pip install -r requirements.txtCreate a .env file in the root directory:
OPENAI_API_KEY=your_api_key_here
streamlit run app.pyfrom src.document_parser import DocumentParser
from src.llm_extractor import LLMExtractor
from src.validator import OutputValidator
from src.db_manager import DatabaseManager
# Initialize components
parser = DocumentParser()
extractor = LLMExtractor()
validator = OutputValidator()
db = DatabaseManager()
# Process a document
text = parser.parse("data/sample_documents/clinical_note.pdf")
extracted = extractor.extract_fields(text)
validated = validator.validate(extracted)
db.insert(validated)| Metric | Score |
|---|---|
| Overall Accuracy | 92% |
| High Confidence Extractions | 87% |
| Flagged for Review | 8% |
| Rejected (hallucination detected) | 5% |