# EdSummariser Utils Core utilities for the EdSummariser RAG system providing document processing, retrieval, and AI integration. ## Core Modules ### `rag.py` - Enhanced Retrieval System - **Multi-strategy search**: Flat, hybrid, Atlas, and local vector search - **Flat index**: Exhaustive search for maximum accuracy - **MongoDB integration**: Chunk storage and retrieval with vector embeddings - **Search types**: `flat`, `hybrid`, `atlas`, `local` ### `chunker.py` - Document Segmentation - **Semantic chunking**: Heading-based text segmentation - **Overlap strategy**: 50-word overlap between chunks for context preservation - **Academic patterns**: Enhanced regex for academic document structures - **Size control**: 150-500 word chunks with intelligent splitting ### `embeddings.py` - Vector Generation - **Sentence Transformers**: all-MiniLM-L6-v2 model (384 dimensions) - **Lazy loading**: Model loaded on first use - **Fallback support**: Random embeddings when model unavailable ### `router.py` - AI Model Routing - **Multi-provider**: Gemini and NVIDIA API integration - **Model selection**: Automatic routing based on query complexity - **Retry logic**: Robust error handling with key rotation ### `parser.py` - Document Processing - **PDF parsing**: PyMuPDF with image extraction - **DOCX support**: Microsoft Word document processing - **Image handling**: PIL integration for document images ## AI Integration ### `summarizer.py` - Content Summarization - **Cheap summarization**: Lightweight text summarization - **Content cleaning**: LLM-based chunk text cleaning - **Topic extraction**: Single-sentence topic generation ### `caption.py` - Image Analysis - **BLIP integration**: Image captioning for document images - **Visual context**: Image-to-text conversion for RAG ## Memory & Context ### `memo/` - Memory Management - **Conversation history**: LRU-based memory system - **Context retrieval**: Semantic and recent context selection - **NVIDIA integration**: File relevance classification - **Session-specific memory**: Isolated memory per chat session - **Auto-naming**: AI-powered session naming based on first query - **Memory cleanup**: Session and project-level memory management ## Key Features ### Enhanced RAG Capabilities - **Chain of Thought**: Query variation generation for better retrieval - **Multi-query search**: 3-5 query variations per search - **Smart deduplication**: Result ranking and deduplication - **Fallback strategies**: 4-tier fallback system for zero results ### Document Processing - **Academic-aware chunking**: Specialized patterns for academic documents - **Context preservation**: Overlapping chunks maintain document flow - **Metadata extraction**: Page spans, topics, and summaries ### Performance Optimizations - **Lazy loading**: Models loaded only when needed - **Caching**: API key rotation and retry mechanisms - **Sampling**: Intelligent document sampling for large datasets ## R&D Areas ### Short-term Improvements - **Query expansion**: More sophisticated query reformulation - **Reranking**: Cross-encoder models for result reranking - **Metadata filtering**: Enhanced metadata-based search ### Long-term Enhancements - **TreeRAG**: Hierarchical document organization - **Hybrid retrieval**: Sparse + dense retrieval combination - **Fine-tuning**: Domain-specific embedding models - **Evaluation framework**: Retrieval accuracy metrics ## Maintenance ### Dependencies - **Core**: `sentence-transformers`, `pymongo`, `numpy` - **PDF**: `PyMuPDF`, `PIL` - **AI**: `httpx` for API calls - **Optional**: `weasyprint` for PDF generation ### Configuration - **Environment variables**: API keys, model names, search preferences - **MongoDB**: Vector index configuration for Atlas - **Model settings**: Embedding dimensions, chunk sizes ### Monitoring - **Logging**: Comprehensive logging across all modules - **Error handling**: Graceful degradation and fallbacks - **Performance**: Search strategy selection based on results ## Usage ```python # Basic RAG usage from utils.rag.rag import RAGStore from utils.rag.embeddings import EmbeddingClient rag = RAGStore(mongo_uri, db_name) embedder = EmbeddingClient() # Enhanced search hits = rag.vector_search( user_id, project_id, query_vector, k=6, search_type="flat" ) ``` ## File Structure ``` utils/ ├── rag.py # Core retrieval system ├── chunker.py # Document segmentation ├── embeddings.py # Vector generation ├── router.py # AI model routing ├── parser.py # Document parsing ├── summarizer.py # Content summarization ├── caption.py # Image analysis ├── common.py # Shared utilities ├── logger.py # Logging configuration └── memo/ # Memory management ├── core.py # Memory system core ├── history.py # Conversation history ├── nvidia.py # NVIDIA integration └── session.py # Session-specific memory management ```