PRD: Multi-Document Private RAG Chatbot (Phase 1)

## Problem Statement

Users who have a collection of private documents (PDFs, Word documents, text files, Markdown) across a folder structure need a way to ask natural language questions and get accurate, cited answers drawn from across all their documents. Existing solutions either require uploading documents to cloud services (privacy concern), only handle one document at a time, or use basic retrieval that misses relevant information.

The user has built a single-document RAG chatbot previously but needs a system that: queries across hundreds of documents simultaneously, runs entirely locally for privacy, uses modern retrieval techniques for better answer quality, and provides clear source attribution so answers can be verified.

## Solution

A locally-hosted conversational RAG chatbot called **Multi Doc Query (MQD)** that:

- Ingests an entire folder of documents (PDF, DOCX, TXT, MD) into a local vector database
- Uses hybrid search (BM25 keyword + semantic vector search) with reciprocal rank fusion for significantly better retrieval than pure semantic search
- Reranks results with a cross-encoder model for precision
- Generates streaming answers via a local LLM (Ollama) with inline citations pointing to specific documents and pages
- Handles follow-up questions by condensing them into standalone queries
- Shows retrieval steps transparently so the user can see what the system is doing
- Detects and highlights conflicting information across documents
- Runs completely offline — no data leaves the machine

## User Stories

1. As a user, I want to point the app at a folder of documents and have them all ingested, so that I can query across my entire document collection
2. As a user, I want the app to recursively scan subfolders by default, so that my existing folder organisation is respected without manual flattening
3. As a user, I want to toggle recursive scanning on or off, so that I can control the scope of ingestion
4. As a user, I want to see a progress indicator during document ingestion, so that I know the system is working and how long it might take
5. As a user, I want to see how many documents were successfully ingested and how many failed, so that I know the state of my index
6. As a user, I want failed documents to be skipped without blocking the rest, so that one corrupted file doesn't prevent me from querying everything else
7. As a user, I want to see which documents failed and why, so that I can fix issues if needed
8. As a user, I want the app to detect new or changed documents on startup, so that my index stays current without manual intervention
9. As a user, I want a manual re-ingest button, so that I can force a refresh when needed
10. As a user, I want to ask natural language questions and get accurate answers drawn from all my documents, so that I can find information without reading every document myself
11. As a user, I want answers to include inline citations like [filename.pdf, p. 12], so that I can verify the answer against the original source
12. As a user, I want to expand and read the actual source chunks that informed an answer, so that I can see the evidence in context
13. As a user, I want source chunks ordered by relevance, so that the most important evidence appears first
14. As a user, I want the system to highlight when documents contain conflicting information, so that I'm aware of discrepancies rather than getting a misleading single answer
15. As a user, I want to ask follow-up questions like "tell me more about that" or "which document mentioned the deadline?", so that I can have a natural conversation rather than crafting perfect standalone queries
16. As a user, I want to see answers stream word-by-word as they're generated, so that I get immediate feedback rather than waiting for the complete response
17. As a user, I want to see the retrieval steps (searching, reranking, generating) in a collapsible display, so that I can understand what the system is doing without it cluttering the interface
18. As a user, I want the app to check that Ollama is running and required models are available on startup, so that I get a clear error message with instructions rather than a cryptic failure
19. As a user, I want a welcome message when no documents are indexed, so that I know how to get started
20. As a user, I want a clear message if I try to ask a question before ingesting documents, so that I understand why no answer was provided
21. As a user, I want a status indicator showing how many documents are indexed, so that I can confirm the system is ready
22. As a user, I want all processing to happen locally on my machine, so that my private documents never leave my control
23. As a user, I want PDF page numbers in citations, so that I can find the exact page in the original document
24. As a user, I want Markdown section headers in citations, so that I can navigate to the right section in the original file
25. As a user, I want the document's relative path from the configured folder used in citations (e.g., "tax/2024/guidance.pdf"), so that I can locate the file in my folder structure
26. As a user, I want the app to launch with a single command, so that getting started is simple
27. As a user, I want to configure the document folder path through the UI settings, so that I don't need to edit config files for basic usage
28. As a user, I want developer settings (model names, chunk sizes, retrieval parameters) in a config file, so that I can tune the system without changing code

## Implementation Decisions

### Architecture
- **10 deep modules** with simple, testable interfaces organised into three pipeline stages: ingestion, retrieval, generation
- Pipeline composition happens in the Chainlit app handler which wires modules together
- LangChain used **selectively** — only for document loading and text splitting where it provides genuine value. All other components (embeddings, vector store, BM25, fusion, reranking, prompts, LLM calls, session memory) use direct library/API calls. Document in code comments where LangChain is used and where it could be but isn't.

### Models (all configurable via config.yaml)
- **LLM:** llama3.1:8b via Ollama (fast responses, upgrade path to 70B+ if quality needs it)
- **Embeddings:** mxbai-embed-large via Ollama (1024 dims, MTEB 64.68)
- **Reranker:** bge-reranker-v2-m3 via sentence-transformers (568M params, NDCG@10 46.0)

### Document Ingestion
- **Formats:** PDF (PyPDFLoader), DOCX (Docx2txtLoader), TXT (TextLoader), MD (TextLoader) — all via LangChain
- **Chunking:** 512 tokens with 100 token overlap. RecursiveCharacterTextSplitter for PDF/DOCX/TXT. MarkdownHeaderTextSplitter for .md files (preserves section headers as metadata)
- **Storage:** Single ChromaDB collection with rich metadata per chunk: filename, doc_type, page_number, section_header, chunk_index, doc_hash
- **ChromaDB persist path:** ~/.query_doc/chroma_db/ (outside OneDrive to avoid cloud sync issues)
- **Change detection:** MD5 hash of file contents compared against stored doc_hash metadata
- **Failure handling:** Skip failed documents, continue processing, log errors, summarise at end

### Retrieval Pipeline
```
Query → BM25 (top 20) + Semantic (top 20) → RRF fusion (k=60) → top 30 → rerank → top 10 → LLM
```
- **Hybrid search:** BM25 (rank_bm25) + ChromaDB semantic search
- **Fusion:** Custom Reciprocal Rank Fusion implementation (~15 lines, no framework)
- **Reranking:** bge-reranker-v2-m3 cross-encoder scoring
- **All retrieval parameters** are developer config only (not exposed in UI)

### BM25 Index
- Rebuilt from ChromaDB chunk texts on startup (single source of truth)
- Expected rebuild time: 1-3 seconds for up to 50,000 chunks
- Fall back to pickling if measured performance is unacceptable

### Session Memory
- Condense-then-retrieve: follow-up questions rewritten into standalone questions using chat history via a small LLM call before retrieval
- First/independent questions skip condensation
- Chat history does not consume the answer generation context window

### Prompt Template
- Generic persona (no domain framing — documents could be about anything)
- Instructs LLM to: cite sources, highlight conflicts across documents, say clearly when no relevant excerpts found
- Context formatted with source headers per chunk: "--- Source: filename.pdf | Page 12 ---"

### UI (Chainlit)
- Streaming responses
- Inline source elements (Chainlit Elements) — clickable, expandable
- Collapsible retrieval step display (collapsed by default)
- Settings panel (gear icon) for folder path and recursive toggle
- Action button for manual re-ingest
- Default Chainlit theme
- Startup health checks with actionable error messages

### Modules
1. **DocumentLoader** — `load_folder(path, recursive) → list[Document]`
2. **Chunker** — `chunk_documents(documents) → list[Chunk]`
3. **VectorStore** — `add_chunks(chunks)`, `search(query, k)`, `get_all_texts()`, `has_document(doc_hash)`
4. **BM25Index** — `build(texts)`, `search(query, k)`
5. **HybridRetriever** — `retrieve(query) → list[ScoredChunk]` (orchestrates BM25 + semantic + RRF)
6. **Reranker** — `rerank(query, chunks, top_k) → list[ScoredChunk]`
7. **Condenser** — `condense(question, chat_history) → standalone_question`
8. **Answerer** — `answer(question, chunks) → stream[str]`
9. **Config** — `load_config(path) → AppConfig`
10. **HealthCheck** — `check_ollama()`, `check_models(required)`

### Configuration
- Single config.yaml for all developer settings (models, chunk sizes, retrieval params, paths)
- No .env file needed (no API keys — fully local system)

### Tooling
- **Package manager:** uv + pyproject.toml
- **Launch command:** `uv run chainlit run app.py`
- **Git:** Feature branches per issue, PRs to main

## Testing Decisions

### What makes a good test
- Tests verify external behaviour through the module's public interface, not implementation details
- Tests should be deterministic — no dependency on LLM output
- Tests use realistic but minimal fixtures (small test documents, small corpora)
- Integration tests use in-memory ChromaDB to avoid filesystem side effects

### Modules with tests (all deterministic modules)
1. **DocumentLoader** — feed known test files (one of each format), assert correct Document objects with correct metadata. Test failure handling with corrupted files.
2. **Chunker** — feed known documents, assert chunk count, chunk sizes within bounds, overlap present, metadata preserved. Test format-specific splitting (MD section headers).
3. **VectorStore** — use in-memory ChromaDB. Test add/search/get_all_texts/has_document. Test deduplication via doc_hash.
4. **BM25Index** — build from known corpus, assert expected documents returned for known queries. Test empty corpus edge case.
5. **HybridRetriever** — mock BM25Index and VectorStore with known ranked lists, assert RRF fusion produces correct merged ordering.
6. **Reranker** — feed known query + chunks, assert output is reordered (top result changes from input order). Test top_k parameter respected.
7. **Config** — feed valid YAML, assert correct AppConfig. Feed invalid/missing YAML, assert appropriate errors. Test default values.
8. **HealthCheck** — mock HTTP responses from Ollama, assert correct status and error messages for: Ollama running, Ollama down, models present, models missing.

### Not tested (LLM-dependent)
- **Condenser** and **Answerer** — depend on non-deterministic LLM output. Prompt construction could be tested but deferred to avoid brittle tests. Output quality deferred to RAGAS evaluation in Phase 2.

## Out of Scope

The following are explicitly deferred to Phase 2 or Phase 3:

- **Agentic retrieval** — LLM dynamically choosing search strategy (Phase 2)
- **Query transformation** — HyDE, query decomposition, multi-query (Phase 2)
- **Metadata filtering in UI** — filtering by document type, date, name (Phase 2)
- **Incremental ingestion optimisation** — only re-process changed chunks within a document (Phase 2)
- **RAGAS evaluation dashboard** — systematic quality measurement (Phase 2)
- **Parent-child chunking** — hierarchical retrieval with small match / large context (Phase 3)
- **Document management UI** — add/remove individual documents, view ingestion status per document (Phase 3)
- **Graph RAG** — knowledge graphs for entity-relationship queries (Phase 3)
- **Multi-modal** — images and tables within documents (Phase 3)
- **Custom theming** — branded UI appearance (not planned)
- **Multi-user support** — authentication, per-user document sets (not planned)
- **Cloud deployment** — this is a local-first, privacy-focused application

## Further Notes

- This project serves dual purposes: (1) a genuinely useful private document query tool, and (2) a trial of Matt Pocock's PRD-driven AI development methodology
- The user has 128GB RAM on Mac Studio M4 Max — hardware is not a constraint for any model choice
- The existing doc_query project (single-document RAG) serves as a reference implementation for patterns like cross-encoder reranking and source citation
- All three model choices (LLM, embeddings, reranker) are configurable via config.yaml to allow experimentation without code changes
- LangChain usage boundaries should be documented in code comments to maintain awareness of framework coupling
- The /prd-to-issues step should break this into vertical slices aligned with the module structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRD: Multi-Document Private RAG Chatbot (Phase 1) #2

Problem Statement

Solution

User Stories

Implementation Decisions

Architecture

Models (all configurable via config.yaml)

Document Ingestion

Retrieval Pipeline

BM25 Index

Session Memory

Prompt Template

UI (Chainlit)

Modules

Configuration

Tooling

Testing Decisions

What makes a good test

Modules with tests (all deterministic modules)

Not tested (LLM-dependent)

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PRD: Multi-Document Private RAG Chatbot (Phase 1) #2

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Architecture

Models (all configurable via config.yaml)

Document Ingestion

Retrieval Pipeline

BM25 Index

Session Memory

Prompt Template

UI (Chainlit)

Modules

Configuration

Tooling

Testing Decisions

What makes a good test

Modules with tests (all deterministic modules)

Not tested (LLM-dependent)

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions