SA-RAG (Semantic-Accelerated Retrieval Augmentation) is a high-performance RAG system designed for enterprise knowledge bases. It combines Rust's performance with Python's flexibility to provide a complete RAG solution.
┌─────────────────────────────────────────────────────────────┐
│ Python Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ RAG │ │ Client │ │ LLM │ │Embedding │ │
│ │ Pipeline │ │ API │ │ Service │ │ Service │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └─────────────┴──────────────┴──────────────┘ │
│ │ │
│ Orchestrator │
└─────────────────────────┼─────────────────────────────────────┘
│ PyO3
▼
┌─────────────────────────────────────────────────────────────┐
│ Rust Core │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Parser │ │ Indexer │ │ Graph │ │ Memory │ │
│ │ │ │ │ │ │ │ │ │
│ │ Semantic │ │ Vector │ │ Adjacency│ │ Ebbinghaus│ │
│ │ Nodes │ │ (HNSW) │ │ List │ │ Curve │ │
│ │ │ │ + BM25 │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Differential Indexing │ │
│ │ (Document Version Management) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Purpose: Automatically segment documents into hierarchical semantic nodes.
Algorithm:
- Identifies headings (Markdown style
#,##, or numbered sections) - Splits text into paragraphs, sentences, and chunks
- Builds parent-child relationships
- Creates
SemanticNodestructures with metadata
Key Features:
- Supports both English and Chinese headings
- Maintains document hierarchy
- Preserves text spans for efficient retrieval
Purpose: High-performance approximate nearest neighbor search.
Algorithm: HNSW (Hierarchical Navigable Small World)
Features:
- Multi-layer graph structure
- Greedy search from top layer
- Dynamic insertion with neighbor selection
- Cosine similarity for vector comparison
Parameters:
m: Maximum connections per layer (default: 16)m_max: Controls layer sparsity (default: 64)ef_construction: Search width during insertion
Purpose: Full-text search with relevance scoring.
Algorithm: BM25 ranking function
Features:
- Tokenization and stemming (English)
- Inverted index for fast lookup
- IDF (Inverse Document Frequency) calculation
- TF (Term Frequency) weighting
Formula:
BM25(q, d) = Σ IDF(qi) * (f(qi, d) * (k1 + 1)) / (f(qi, d) + k1 * (1 - b + b * |d| / avgdl))
Purpose: Combine vector and BM25 search results.
Algorithm: Reciprocal Rank Fusion (RRF)
Formula:
RRF(d) = Σ 1 / (k + rank_i(d))
Features:
- Configurable weights for vector and BM25
- Automatic score normalization
- Top-k result selection
Purpose: Store and query semantic relationships.
Structure: Adjacency list representation
Edge Types:
ParentChild: Hierarchical relationshipsSemanticSimilarity: Similarity-based connectionsMention: Reference relationshipsNextChunk: Sequential relationshipsReference: Cross-referencesSynonym: Synonym relationships
Features:
- Weighted edges (0.0 - 1.0)
- Edge metadata support
- Bidirectional edge support
- Graph statistics
Purpose: Expand retrieval results through graph traversal.
Strategies:
- Basic Expansion: Breadth-first expansion
- Weighted Expansion: Filter by edge weight threshold
- Type-aware Expansion: Prioritize specific edge types
- Diverse Expansion: Control expansion diversity
- Smart Expansion: Combined strategy with decay
Algorithm:
- BFS traversal from seed nodes
- Score decay with depth
- Type-based filtering
- Weight threshold filtering
Purpose: Store and retrieve long-term memories with temporal decay.
Algorithm: Ebbinghaus Forgetting Curve
Formula:
retention = exp(-decay_rate * time_passed / importance_factor)
score = importance * retention * relevance
Features:
- Importance-weighted decay
- Access frequency boost (spaced repetition)
- Keyword and vector-based retrieval
- Automatic memory pruning
Purpose: Incremental document updates without full re-indexing.
Algorithm: Simplified Myers diff algorithm
Features:
- Document version management
- Changed segment identification
- Partial embedding regeneration
- Version history tracking
Purpose: Orchestrate the complete RAG workflow.
Workflow:
- Document indexing with embedding generation
- Query rewriting
- Hybrid retrieval (vector + BM25)
- Graph expansion (optional)
- Memory retrieval (optional)
- Result fusion
- Answer generation with LLM
Purpose: Interface with various LLM providers.
Supported Providers:
- OpenAI (GPT models)
- DeepSeek
- Local models (via transformers)
- Mock mode (for testing)
Features:
- Chat completion
- RAG-enhanced generation
- Context management
- Temperature control
Purpose: Generate text embeddings.
Supported Providers:
- OpenAI embeddings
- DeepSeek embeddings
- Local models (sentence-transformers)
- Mock mode (deterministic)
Features:
- Batch embedding generation
- Vector normalization
- Cosine similarity calculation
Purpose: Coordinate query processing and result fusion.
Features:
- Query rewriting (rule-based or LLM-based)
- Retrieval strategy planning
- Result fusion (RRF, weighted, max)
- Graph expansion decision making
Purpose: High-level API for end users.
Features:
- Simplified interface
- Statistics tracking
- Error handling
- Configuration management
Document Text
│
▼
Python: RAG.index_documents()
│
▼
Rust: RustCoreEngine.index_documents()
│
├─► Parser: Parse into SemanticNodes
│ │
│ ├─► Create nodes with hierarchy
│ └─► Build parent-child relationships
│
├─► Graph: Add nodes and edges
│ │
│ └─► Create ParentChild edges
│
├─► BM25: Index text content
│ │
│ └─► Build inverted index
│
└─► Python: Generate embeddings
│
└─► Rust: Update vector index (HNSW)
User Query
│
▼
Python: RAG.ask()
│
├─► Orchestrator: Rewrite query
│
├─► Embedding: Generate query embedding
│
├─► Rust: Hybrid search
│ │
│ ├─► Vector search (HNSW)
│ └─► BM25 search
│
├─► Rust: Graph expansion (optional)
│ │
│ └─► Expand seed nodes
│
├─► Rust: Memory retrieval (optional)
│
├─► Orchestrator: Fuse results (RRF)
│
└─► LLM: Generate answer
│
└─► Return to user
- Throughput: ~1000 documents/second (with embeddings)
- Memory: O(n) where n is number of nodes
- Disk: Optional persistence (not implemented)
- Vector Search: O(log n) with HNSW
- BM25 Search: O(m) where m is query term count
- Graph Expansion: O(k * d) where k is seed nodes, d is depth
- Total Latency: <50ms for typical queries
- Storage: O(n) for n memories
- Retrieval: O(n) with pruning optimization
- Decay Calculation: O(1) per memory
- Rust Core: Thread-safe with
Arc<Mutex<>>for shared state - Python Layer: GIL-bound (single-threaded Python execution)
- Future: Consider async/await for I/O-bound operations
- Implement in Rust core
- Expose via PyO3
- Integrate in
HybridIndex - Add to orchestrator
- Implement in
LLMService - Add provider-specific logic
- Update configuration
- Implement in
graph/expansion.rs - Add to
GraphExpansionstruct - Expose via PyO3
- Persistence: Add disk-based storage for indexes
- Distributed: Support distributed indexing and retrieval
- Async: Async/await support for Python layer
- Caching: Add result caching for common queries
- Monitoring: Add metrics and observability
- Optimization: Further optimize HNSW parameters
- Multi-language: Better support for non-English text