-
Notifications
You must be signed in to change notification settings - Fork 0
PRD: Multi-Document Private RAG Chatbot (Phase 1) #2
Description
Problem Statement
Users who have a collection of private documents (PDFs, Word documents, text files, Markdown) across a folder structure need a way to ask natural language questions and get accurate, cited answers drawn from across all their documents. Existing solutions either require uploading documents to cloud services (privacy concern), only handle one document at a time, or use basic retrieval that misses relevant information.
The user has built a single-document RAG chatbot previously but needs a system that: queries across hundreds of documents simultaneously, runs entirely locally for privacy, uses modern retrieval techniques for better answer quality, and provides clear source attribution so answers can be verified.
Solution
A locally-hosted conversational RAG chatbot called Multi Doc Query (MQD) that:
- Ingests an entire folder of documents (PDF, DOCX, TXT, MD) into a local vector database
- Uses hybrid search (BM25 keyword + semantic vector search) with reciprocal rank fusion for significantly better retrieval than pure semantic search
- Reranks results with a cross-encoder model for precision
- Generates streaming answers via a local LLM (Ollama) with inline citations pointing to specific documents and pages
- Handles follow-up questions by condensing them into standalone queries
- Shows retrieval steps transparently so the user can see what the system is doing
- Detects and highlights conflicting information across documents
- Runs completely offline — no data leaves the machine
User Stories
- As a user, I want to point the app at a folder of documents and have them all ingested, so that I can query across my entire document collection
- As a user, I want the app to recursively scan subfolders by default, so that my existing folder organisation is respected without manual flattening
- As a user, I want to toggle recursive scanning on or off, so that I can control the scope of ingestion
- As a user, I want to see a progress indicator during document ingestion, so that I know the system is working and how long it might take
- As a user, I want to see how many documents were successfully ingested and how many failed, so that I know the state of my index
- As a user, I want failed documents to be skipped without blocking the rest, so that one corrupted file doesn't prevent me from querying everything else
- As a user, I want to see which documents failed and why, so that I can fix issues if needed
- As a user, I want the app to detect new or changed documents on startup, so that my index stays current without manual intervention
- As a user, I want a manual re-ingest button, so that I can force a refresh when needed
- As a user, I want to ask natural language questions and get accurate answers drawn from all my documents, so that I can find information without reading every document myself
- As a user, I want answers to include inline citations like [filename.pdf, p. 12], so that I can verify the answer against the original source
- As a user, I want to expand and read the actual source chunks that informed an answer, so that I can see the evidence in context
- As a user, I want source chunks ordered by relevance, so that the most important evidence appears first
- As a user, I want the system to highlight when documents contain conflicting information, so that I'm aware of discrepancies rather than getting a misleading single answer
- As a user, I want to ask follow-up questions like "tell me more about that" or "which document mentioned the deadline?", so that I can have a natural conversation rather than crafting perfect standalone queries
- As a user, I want to see answers stream word-by-word as they're generated, so that I get immediate feedback rather than waiting for the complete response
- As a user, I want to see the retrieval steps (searching, reranking, generating) in a collapsible display, so that I can understand what the system is doing without it cluttering the interface
- As a user, I want the app to check that Ollama is running and required models are available on startup, so that I get a clear error message with instructions rather than a cryptic failure
- As a user, I want a welcome message when no documents are indexed, so that I know how to get started
- As a user, I want a clear message if I try to ask a question before ingesting documents, so that I understand why no answer was provided
- As a user, I want a status indicator showing how many documents are indexed, so that I can confirm the system is ready
- As a user, I want all processing to happen locally on my machine, so that my private documents never leave my control
- As a user, I want PDF page numbers in citations, so that I can find the exact page in the original document
- As a user, I want Markdown section headers in citations, so that I can navigate to the right section in the original file
- As a user, I want the document's relative path from the configured folder used in citations (e.g., "tax/2024/guidance.pdf"), so that I can locate the file in my folder structure
- As a user, I want the app to launch with a single command, so that getting started is simple
- As a user, I want to configure the document folder path through the UI settings, so that I don't need to edit config files for basic usage
- As a user, I want developer settings (model names, chunk sizes, retrieval parameters) in a config file, so that I can tune the system without changing code
Implementation Decisions
Architecture
- 10 deep modules with simple, testable interfaces organised into three pipeline stages: ingestion, retrieval, generation
- Pipeline composition happens in the Chainlit app handler which wires modules together
- LangChain used selectively — only for document loading and text splitting where it provides genuine value. All other components (embeddings, vector store, BM25, fusion, reranking, prompts, LLM calls, session memory) use direct library/API calls. Document in code comments where LangChain is used and where it could be but isn't.
Models (all configurable via config.yaml)
- LLM: llama3.1:8b via Ollama (fast responses, upgrade path to 70B+ if quality needs it)
- Embeddings: mxbai-embed-large via Ollama (1024 dims, MTEB 64.68)
- Reranker: bge-reranker-v2-m3 via sentence-transformers (568M params, NDCG@10 46.0)
Document Ingestion
- Formats: PDF (PyPDFLoader), DOCX (Docx2txtLoader), TXT (TextLoader), MD (TextLoader) — all via LangChain
- Chunking: 512 tokens with 100 token overlap. RecursiveCharacterTextSplitter for PDF/DOCX/TXT. MarkdownHeaderTextSplitter for .md files (preserves section headers as metadata)
- Storage: Single ChromaDB collection with rich metadata per chunk: filename, doc_type, page_number, section_header, chunk_index, doc_hash
- ChromaDB persist path: ~/.query_doc/chroma_db/ (outside OneDrive to avoid cloud sync issues)
- Change detection: MD5 hash of file contents compared against stored doc_hash metadata
- Failure handling: Skip failed documents, continue processing, log errors, summarise at end
Retrieval Pipeline
Query → BM25 (top 20) + Semantic (top 20) → RRF fusion (k=60) → top 30 → rerank → top 10 → LLM
- Hybrid search: BM25 (rank_bm25) + ChromaDB semantic search
- Fusion: Custom Reciprocal Rank Fusion implementation (~15 lines, no framework)
- Reranking: bge-reranker-v2-m3 cross-encoder scoring
- All retrieval parameters are developer config only (not exposed in UI)
BM25 Index
- Rebuilt from ChromaDB chunk texts on startup (single source of truth)
- Expected rebuild time: 1-3 seconds for up to 50,000 chunks
- Fall back to pickling if measured performance is unacceptable
Session Memory
- Condense-then-retrieve: follow-up questions rewritten into standalone questions using chat history via a small LLM call before retrieval
- First/independent questions skip condensation
- Chat history does not consume the answer generation context window
Prompt Template
- Generic persona (no domain framing — documents could be about anything)
- Instructs LLM to: cite sources, highlight conflicts across documents, say clearly when no relevant excerpts found
- Context formatted with source headers per chunk: "--- Source: filename.pdf | Page 12 ---"
UI (Chainlit)
- Streaming responses
- Inline source elements (Chainlit Elements) — clickable, expandable
- Collapsible retrieval step display (collapsed by default)
- Settings panel (gear icon) for folder path and recursive toggle
- Action button for manual re-ingest
- Default Chainlit theme
- Startup health checks with actionable error messages
Modules
- DocumentLoader —
load_folder(path, recursive) → list[Document] - Chunker —
chunk_documents(documents) → list[Chunk] - VectorStore —
add_chunks(chunks),search(query, k),get_all_texts(),has_document(doc_hash) - BM25Index —
build(texts),search(query, k) - HybridRetriever —
retrieve(query) → list[ScoredChunk](orchestrates BM25 + semantic + RRF) - Reranker —
rerank(query, chunks, top_k) → list[ScoredChunk] - Condenser —
condense(question, chat_history) → standalone_question - Answerer —
answer(question, chunks) → stream[str] - Config —
load_config(path) → AppConfig - HealthCheck —
check_ollama(),check_models(required)
Configuration
- Single config.yaml for all developer settings (models, chunk sizes, retrieval params, paths)
- No .env file needed (no API keys — fully local system)
Tooling
- Package manager: uv + pyproject.toml
- Launch command:
uv run chainlit run app.py - Git: Feature branches per issue, PRs to main
Testing Decisions
What makes a good test
- Tests verify external behaviour through the module's public interface, not implementation details
- Tests should be deterministic — no dependency on LLM output
- Tests use realistic but minimal fixtures (small test documents, small corpora)
- Integration tests use in-memory ChromaDB to avoid filesystem side effects
Modules with tests (all deterministic modules)
- DocumentLoader — feed known test files (one of each format), assert correct Document objects with correct metadata. Test failure handling with corrupted files.
- Chunker — feed known documents, assert chunk count, chunk sizes within bounds, overlap present, metadata preserved. Test format-specific splitting (MD section headers).
- VectorStore — use in-memory ChromaDB. Test add/search/get_all_texts/has_document. Test deduplication via doc_hash.
- BM25Index — build from known corpus, assert expected documents returned for known queries. Test empty corpus edge case.
- HybridRetriever — mock BM25Index and VectorStore with known ranked lists, assert RRF fusion produces correct merged ordering.
- Reranker — feed known query + chunks, assert output is reordered (top result changes from input order). Test top_k parameter respected.
- Config — feed valid YAML, assert correct AppConfig. Feed invalid/missing YAML, assert appropriate errors. Test default values.
- HealthCheck — mock HTTP responses from Ollama, assert correct status and error messages for: Ollama running, Ollama down, models present, models missing.
Not tested (LLM-dependent)
- Condenser and Answerer — depend on non-deterministic LLM output. Prompt construction could be tested but deferred to avoid brittle tests. Output quality deferred to RAGAS evaluation in Phase 2.
Out of Scope
The following are explicitly deferred to Phase 2 or Phase 3:
- Agentic retrieval — LLM dynamically choosing search strategy (Phase 2)
- Query transformation — HyDE, query decomposition, multi-query (Phase 2)
- Metadata filtering in UI — filtering by document type, date, name (Phase 2)
- Incremental ingestion optimisation — only re-process changed chunks within a document (Phase 2)
- RAGAS evaluation dashboard — systematic quality measurement (Phase 2)
- Parent-child chunking — hierarchical retrieval with small match / large context (Phase 3)
- Document management UI — add/remove individual documents, view ingestion status per document (Phase 3)
- Graph RAG — knowledge graphs for entity-relationship queries (Phase 3)
- Multi-modal — images and tables within documents (Phase 3)
- Custom theming — branded UI appearance (not planned)
- Multi-user support — authentication, per-user document sets (not planned)
- Cloud deployment — this is a local-first, privacy-focused application
Further Notes
- This project serves dual purposes: (1) a genuinely useful private document query tool, and (2) a trial of Matt Pocock's PRD-driven AI development methodology
- The user has 128GB RAM on Mac Studio M4 Max — hardware is not a constraint for any model choice
- The existing doc_query project (single-document RAG) serves as a reference implementation for patterns like cross-encoder reranking and source citation
- All three model choices (LLM, embeddings, reranker) are configurable via config.yaml to allow experimentation without code changes
- LangChain usage boundaries should be documented in code comments to maintain awareness of framework coupling
- The /prd-to-issues step should break this into vertical slices aligned with the module structure