Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / project-plan.md

Seth McKnight

Add memory diagnostics endpoints and logging enhancements (#80)

0a7f9b4 about 2 months ago

preview code

raw

history blame

11.8 kB

	# RAG Application Project Plan

	This plan outlines the steps to design, build, and deploy a Retrieval-Augmented Generation (RAG) application as per the project requirements, with a focus on achieving a grade of 5. The approach prioritizes early deployment and continuous integration, following Test-Driven Development (TDD) principles.

	## 1. Foundational Setup

	- [x] Repository: Create a new GitHub repository.
	- [x] Virtual Environment: Set up a local Python virtual environment (`venv`).
	- [x] Initial Files:
	- Create `requirements.txt` with initial dependencies (`Flask`, `pytest`).
	- Create a `.gitignore` file for Python.
	- Create a `README.md` with initial setup instructions.
	- Create placeholder files: `deployed.md` and `design-and-evaluation.md`.
	- [x] Testing Framework: Establish a `tests/` directory and configure `pytest`.

	## 2. "Hello World" Deployment

	- [x] Minimal App: Develop a minimal Flask application (`app.py`) with a `/health` endpoint that returns a JSON status object.
	- [x] Unit Test: Write a test for the `/health` endpoint to ensure it returns a `200 OK` status and the correct JSON payload.
	- [x] Local Validation: Run the app and tests locally to confirm everything works.

	## 3. CI/CD and Initial Deployment

	- [x] Render Setup: Create a new Web Service on Render and link it to the GitHub repository.
	- [x] Environment Configuration: Configure necessary environment variables on Render (e.g., `PYTHON_VERSION`).
	- [x] GitHub Actions: Create a CI/CD workflow (`.github/workflows/main.yml`) that:
	- Triggers on push/PR to the `main` branch.
	- Installs dependencies from `requirements.txt`.
	- Runs the `pytest` test suite.
	- On success, triggers a deployment to Render.
	- [x] Deployment Validation: Push a change and verify that the workflow runs successfully and the application is deployed.
	- [ ] Documentation: Update `deployed.md` with the live URL of the deployed application.

	### CI/CD optimizations added

	- [x] Add pip cache to CI to speed up dependency installation.
	- [x] Optimize pre-commit in PRs to run only changed-file hooks (use `pre-commit run --from-ref ... --to-ref ...`).

	## 4. Data Ingestion and Processing

	- [x] Corpus Assembly: Collect or generate 5-20 policy documents (PDF, TXT, MD) and place them in a `synthetic_policies/` directory.
	- [x] Parsing Logic: Implement and test functions to parse different document formats.
	- [x] Chunking Strategy: Implement and test a document chunking strategy (e.g., recursive character splitting with overlap).
	- [x] Reproducibility: Set fixed seeds for any processes involving randomness (e.g., chunking, sampling) to ensure deterministic outcomes.

	## 5. Embedding and Vector Storage ✅ PHASE 2B COMPLETED

	- [x] Vector DB Setup: Integrate a vector database (ChromaDB) into the project.
	- [x] Embedding Model: Select and integrate a free embedding model (`paraphrase-MiniLM-L3-v2` chosen for memory efficiency).
	- [x] Ingestion Pipeline: Create enhanced ingestion pipeline that:
	- Loads documents from the corpus.
	- Chunks the documents with metadata.
	- Embeds the chunks using sentence-transformers.
	- Stores the embeddings in ChromaDB vector database.
	- Provides detailed processing statistics.
	- [x] Testing: Write comprehensive tests (60+ tests) verifying each step of the ingestion pipeline.
	- [x] Search API: Implement POST `/search` endpoint for semantic search with:
	- JSON request/response format
	- Configurable parameters (top_k, threshold)
	- Comprehensive input validation
	- Detailed error handling
	- [x] End-to-End Testing: Complete pipeline testing from ingestion through search.
	- [x] Documentation: Full API documentation with examples and performance metrics.

	## 6. RAG Core Implementation ✅ PHASE 3 COMPLETED

	- [x] Retrieval Logic: Implement a function to retrieve the top-k relevant document chunks from the vector store based on a user query.
	- [x] Prompt Engineering: Design a prompt template that injects the retrieved context into the query for the LLM.
	- [x] LLM Integration: Connect to a free-tier LLM (e.g., via OpenRouter or Groq) to generate answers.
	- [x] Basic Guardrails: Implement and test basic guardrails for context validation and response length limits.
	- [x] Enhanced Guardrails (Issue #24): ✅ COMPLETED - Comprehensive guardrails and response quality system:
	- [x] Content Safety Filtering: PII detection, bias mitigation, inappropriate content filtering
	- [x] Response Quality Scoring: Multi-dimensional quality assessment (relevance, completeness, coherence, source fidelity)
	- [x] Source Attribution: Automated citation generation and validation
	- [x] Error Handling: Circuit breaker patterns and graceful degradation
	- [x] Configuration System: Flexible thresholds and feature toggles
	- [x] Testing: 13 comprehensive tests with 100% pass rate
	- [x] Integration: Enhanced RAG pipeline with backward compatibility

	## 7. Web Application Completion

	- [x] Chat Interface: ✅ COMPLETED - Implement a simple web chat interface for the `/` endpoint.
	- [x] Modern Chat UI: Interactive chat interface with real-time messaging
	- [x] Message History: Conversation display with user and assistant messages
	- [x] Source Citations: Visual display of source documents and confidence scores
	- [x] Responsive Design: Mobile-friendly interface with modern styling
	- [x] Error Handling: Graceful error display and loading states
	- [x] System Health: Status indicators and health monitoring
	- [x] API Endpoint: Create the `/chat` API endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets.
	- [x] UI/UX: ✅ COMPLETED - Ensure the web interface is clean, user-friendly, and handles loading/error states gracefully.
	- [x] Testing: Write end-to-end tests for the chat functionality.

	## 7.5. Memory Management & Production Optimization ✅ COMPLETED

	- [x] Memory Architecture Redesign: ✅ COMPLETED - Comprehensive memory optimization for cloud deployment:

	- [x] App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
	- Impact: 87% reduction in startup memory (400MB → 50MB)
	- Benefit: Services initialize only when needed, improving resource efficiency
	- [x] Embedding Model Optimization: Changed from `all-MiniLM-L6-v2` to `paraphrase-MiniLM-L3-v2`
	- Memory Savings: 75-85% reduction (550-1000MB → 132MB)
	- Quality Impact: <5% reduction in similarity scoring (acceptable trade-off)
	- Deployment Viability: Enables deployment on Render free tier (512MB limit)
	- [x] Gunicorn Production Configuration: Optimized for memory-constrained environments
	- Configuration: Single worker, 2 threads, max_requests=50
	- Memory Control: Prevent memory leaks with automatic worker restart
	- Performance: Balanced for I/O-bound LLM operations

	- [x] Memory Management Utilities: ✅ COMPLETED - Comprehensive memory monitoring and optimization:

	- [x] MemoryManager Class: Context manager for memory tracking and cleanup
	- [x] Real-time Monitoring: Memory usage tracking with automatic garbage collection
	- [x] Memory Statistics: Detailed memory reporting for production monitoring
	- [x] Error Recovery: Memory-aware error handling with graceful degradation
	- [x] Health Integration: Memory metrics exposed via `/health` endpoint

	- [x] Database Pre-building Strategy: ✅ COMPLETED - Eliminate deployment memory spikes:

	- [x] Local Database Building: `build_embeddings.py` script for development
	- [x] Repository Commitment: Pre-built vector database (25MB) committed to git
	- [x] Deployment Optimization: Zero embedding generation on production startup
	- [x] Memory Impact: Avoid 150MB+ memory spikes during embedding generation

	- [x] Production Deployment Optimization: ✅ COMPLETED - Full production readiness:

	- [x] Memory Profiling: Comprehensive memory usage analysis and optimization
	- [x] Performance Testing: Load testing with memory constraints validation
	- [x] Error Handling: Production-grade error recovery for memory pressure
	- [x] Monitoring Integration: Real-time memory tracking and alerting
	- [x] Documentation: Complete memory management documentation across all files

	- [x] Testing & Validation: ✅ COMPLETED - Memory-aware testing infrastructure:
	- [x] Memory Constraint Testing: All 138 tests pass with memory optimizations
	- [x] Performance Regression Testing: Response time validation maintained
	- [x] Memory Leak Detection: Long-running tests validate memory stability
	- [x] Production Simulation: Testing in memory-constrained environments

	## 8. Evaluation

	- [ ] Evaluation Set: Create an evaluation set of 15-30 questions and corresponding "gold" answers covering various policy topics.
	- [ ] Metric Implementation: Develop scripts to calculate:
	- Answer Quality: Groundedness and Citation Accuracy.
	- System Metrics: Latency (p50/p95).
	- [ ] Execution: Run the evaluation and record the results.
	- [ ] Documentation: Summarize the evaluation results in `design-and-evaluation.md`.

	## 9. Final Documentation and Submission

	- [x] Design Document: ✅ COMPLETED - Complete `design-and-evaluation.md` with comprehensive technical analysis:
	- [x] Memory Architecture Design: Detailed analysis of memory-constrained architecture decisions
	- [x] Performance Evaluation: Comprehensive memory usage, response time, and quality metrics
	- [x] Model Selection Analysis: Embedding model comparison with memory vs quality trade-offs
	- [x] Production Deployment Evaluation: Platform compatibility and scalability analysis
	- [x] Design Trade-offs Documentation: Lessons learned and future considerations
	- [x] README: ✅ COMPLETED - Comprehensive documentation with memory management focus:
	- [x] Memory Management Section: Detailed memory optimization architecture and utilities
	- [x] Production Configuration: Gunicorn, database pre-building, and deployment strategies
	- [x] Performance Metrics: Memory usage breakdown and production performance data
	- [x] Setup Instructions: Memory-aware development and deployment guidelines
	- [x] Deployment Documentation: ✅ COMPLETED - Updated `deployed.md` with production details:
	- [x] Memory-Optimized Configuration: Production memory profile and optimization results
	- [x] Performance Metrics: Real-time memory monitoring and capacity analysis
	- [x] Production Features: Memory management system and error handling documentation
	- [x] Deployment Pipeline: CI/CD integration with memory validation
	- [x] Contributing Guidelines: ✅ COMPLETED - Updated `CONTRIBUTING.md` with memory-conscious development:
	- [x] Memory Development Principles: Guidelines for memory-efficient code patterns
	- [x] Memory Testing Procedures: Development workflow for memory constraint validation
	- [x] Code Review Guidelines: Memory-focused review checklist and best practices
	- [x] Production Testing: Memory leak detection and performance validation procedures
	- [ ] Demonstration Video: Record a 5-10 minute screen-share video demonstrating the deployed application, walking through the code architecture, explaining the evaluation results, and showing a successful CI/CD run.
	- [ ] Submission: Share the GitHub repository with the grader and submit the repository and video links.

	# RAG Application Project Plan

	This plan outlines the steps to design, build, and deploy a Retrieval-Augmented Generation (RAG) application as per the project requirements, with a focus on achieving a grade of 5. The approach prioritizes early deployment and continuous integration, following Test-Driven Development (TDD) principles.

	## 1. Foundational Setup

	- [x] Repository: Create a new GitHub repository.
	- [x] Virtual Environment: Set up a local Python virtual environment (`venv`).
	- [x] Initial Files:
	- Create `requirements.txt` with initial dependencies (`Flask`, `pytest`).
	- Create a `.gitignore` file for Python.
	- Create a `README.md` with initial setup instructions.
	- Create placeholder files: `deployed.md` and `design-and-evaluation.md`.
	- [x] Testing Framework: Establish a `tests/` directory and configure `pytest`.

	## 2. "Hello World" Deployment

	- [x] Minimal App: Develop a minimal Flask application (`app.py`) with a `/health` endpoint that returns a JSON status object.
	- [x] Unit Test: Write a test for the `/health` endpoint to ensure it returns a `200 OK` status and the correct JSON payload.
	- [x] Local Validation: Run the app and tests locally to confirm everything works.

	## 3. CI/CD and Initial Deployment

	- [x] Render Setup: Create a new Web Service on Render and link it to the GitHub repository.
	- [x] Environment Configuration: Configure necessary environment variables on Render (e.g., `PYTHON_VERSION`).
	- [x] GitHub Actions: Create a CI/CD workflow (`.github/workflows/main.yml`) that:
	- Triggers on push/PR to the `main` branch.
	- Installs dependencies from `requirements.txt`.
	- Runs the `pytest` test suite.
	- On success, triggers a deployment to Render.
	- [x] Deployment Validation: Push a change and verify that the workflow runs successfully and the application is deployed.
	- [ ] Documentation: Update `deployed.md` with the live URL of the deployed application.

	### CI/CD optimizations added

	- [x] Add pip cache to CI to speed up dependency installation.
	- [x] Optimize pre-commit in PRs to run only changed-file hooks (use `pre-commit run --from-ref ... --to-ref ...`).

	## 4. Data Ingestion and Processing

	- [x] Corpus Assembly: Collect or generate 5-20 policy documents (PDF, TXT, MD) and place them in a `synthetic_policies/` directory.
	- [x] Parsing Logic: Implement and test functions to parse different document formats.
	- [x] Chunking Strategy: Implement and test a document chunking strategy (e.g., recursive character splitting with overlap).
	- [x] Reproducibility: Set fixed seeds for any processes involving randomness (e.g., chunking, sampling) to ensure deterministic outcomes.

	## 5. Embedding and Vector Storage ✅ PHASE 2B COMPLETED

	- [x] Vector DB Setup: Integrate a vector database (ChromaDB) into the project.
	- [x] Embedding Model: Select and integrate a free embedding model (`paraphrase-MiniLM-L3-v2` chosen for memory efficiency).
	- [x] Ingestion Pipeline: Create enhanced ingestion pipeline that:
	- Loads documents from the corpus.
	- Chunks the documents with metadata.
	- Embeds the chunks using sentence-transformers.
	- Stores the embeddings in ChromaDB vector database.
	- Provides detailed processing statistics.
	- [x] Testing: Write comprehensive tests (60+ tests) verifying each step of the ingestion pipeline.
	- [x] Search API: Implement POST `/search` endpoint for semantic search with:
	- JSON request/response format
	- Configurable parameters (top_k, threshold)
	- Comprehensive input validation
	- Detailed error handling
	- [x] End-to-End Testing: Complete pipeline testing from ingestion through search.
	- [x] Documentation: Full API documentation with examples and performance metrics.

	## 6. RAG Core Implementation ✅ PHASE 3 COMPLETED

	- [x] Retrieval Logic: Implement a function to retrieve the top-k relevant document chunks from the vector store based on a user query.
	- [x] Prompt Engineering: Design a prompt template that injects the retrieved context into the query for the LLM.
	- [x] LLM Integration: Connect to a free-tier LLM (e.g., via OpenRouter or Groq) to generate answers.
	- [x] Basic Guardrails: Implement and test basic guardrails for context validation and response length limits.
	- [x] Enhanced Guardrails (Issue #24): ✅ COMPLETED - Comprehensive guardrails and response quality system:
	- [x] Content Safety Filtering: PII detection, bias mitigation, inappropriate content filtering
	- [x] Response Quality Scoring: Multi-dimensional quality assessment (relevance, completeness, coherence, source fidelity)
	- [x] Source Attribution: Automated citation generation and validation
	- [x] Error Handling: Circuit breaker patterns and graceful degradation
	- [x] Configuration System: Flexible thresholds and feature toggles
	- [x] Testing: 13 comprehensive tests with 100% pass rate
	- [x] Integration: Enhanced RAG pipeline with backward compatibility

	## 7. Web Application Completion

	- [x] Chat Interface: ✅ COMPLETED - Implement a simple web chat interface for the `/` endpoint.
	- [x] Modern Chat UI: Interactive chat interface with real-time messaging
	- [x] Message History: Conversation display with user and assistant messages
	- [x] Source Citations: Visual display of source documents and confidence scores
	- [x] Responsive Design: Mobile-friendly interface with modern styling
	- [x] Error Handling: Graceful error display and loading states
	- [x] System Health: Status indicators and health monitoring
	- [x] API Endpoint: Create the `/chat` API endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets.
	- [x] UI/UX: ✅ COMPLETED - Ensure the web interface is clean, user-friendly, and handles loading/error states gracefully.
	- [x] Testing: Write end-to-end tests for the chat functionality.

	## 7.5. Memory Management & Production Optimization ✅ COMPLETED

	- [x] Memory Architecture Redesign: ✅ COMPLETED - Comprehensive memory optimization for cloud deployment:

	- [x] App Factory Pattern: Migrated from monolithic to factory pattern with lazy loading
	- Impact: 87% reduction in startup memory (400MB → 50MB)
	- Benefit: Services initialize only when needed, improving resource efficiency
	- [x] Embedding Model Optimization: Changed from `all-MiniLM-L6-v2` to `paraphrase-MiniLM-L3-v2`
	- Memory Savings: 75-85% reduction (550-1000MB → 132MB)
	- Quality Impact: <5% reduction in similarity scoring (acceptable trade-off)
	- Deployment Viability: Enables deployment on Render free tier (512MB limit)
	- [x] Gunicorn Production Configuration: Optimized for memory-constrained environments
	- Configuration: Single worker, 2 threads, max_requests=50
	- Memory Control: Prevent memory leaks with automatic worker restart
	- Performance: Balanced for I/O-bound LLM operations

	- [x] Memory Management Utilities: ✅ COMPLETED - Comprehensive memory monitoring and optimization:

	- [x] MemoryManager Class: Context manager for memory tracking and cleanup
	- [x] Real-time Monitoring: Memory usage tracking with automatic garbage collection
	- [x] Memory Statistics: Detailed memory reporting for production monitoring
	- [x] Error Recovery: Memory-aware error handling with graceful degradation
	- [x] Health Integration: Memory metrics exposed via `/health` endpoint

	- [x] Database Pre-building Strategy: ✅ COMPLETED - Eliminate deployment memory spikes:

	- [x] Local Database Building: `build_embeddings.py` script for development
	- [x] Repository Commitment: Pre-built vector database (25MB) committed to git
	- [x] Deployment Optimization: Zero embedding generation on production startup
	- [x] Memory Impact: Avoid 150MB+ memory spikes during embedding generation

	- [x] Production Deployment Optimization: ✅ COMPLETED - Full production readiness:

	- [x] Memory Profiling: Comprehensive memory usage analysis and optimization
	- [x] Performance Testing: Load testing with memory constraints validation
	- [x] Error Handling: Production-grade error recovery for memory pressure
	- [x] Monitoring Integration: Real-time memory tracking and alerting
	- [x] Documentation: Complete memory management documentation across all files

	- [x] Testing & Validation: ✅ COMPLETED - Memory-aware testing infrastructure:
	- [x] Memory Constraint Testing: All 138 tests pass with memory optimizations
	- [x] Performance Regression Testing: Response time validation maintained
	- [x] Memory Leak Detection: Long-running tests validate memory stability
	- [x] Production Simulation: Testing in memory-constrained environments

	## 8. Evaluation

	- [ ] Evaluation Set: Create an evaluation set of 15-30 questions and corresponding "gold" answers covering various policy topics.
	- [ ] Metric Implementation: Develop scripts to calculate:
	- Answer Quality: Groundedness and Citation Accuracy.
	- System Metrics: Latency (p50/p95).
	- [ ] Execution: Run the evaluation and record the results.
	- [ ] Documentation: Summarize the evaluation results in `design-and-evaluation.md`.

	## 9. Final Documentation and Submission

	- [x] Design Document: ✅ COMPLETED - Complete `design-and-evaluation.md` with comprehensive technical analysis:
	- [x] Memory Architecture Design: Detailed analysis of memory-constrained architecture decisions
	- [x] Performance Evaluation: Comprehensive memory usage, response time, and quality metrics
	- [x] Model Selection Analysis: Embedding model comparison with memory vs quality trade-offs
	- [x] Production Deployment Evaluation: Platform compatibility and scalability analysis
	- [x] Design Trade-offs Documentation: Lessons learned and future considerations
	- [x] README: ✅ COMPLETED - Comprehensive documentation with memory management focus:
	- [x] Memory Management Section: Detailed memory optimization architecture and utilities
	- [x] Production Configuration: Gunicorn, database pre-building, and deployment strategies
	- [x] Performance Metrics: Memory usage breakdown and production performance data
	- [x] Setup Instructions: Memory-aware development and deployment guidelines
	- [x] Deployment Documentation: ✅ COMPLETED - Updated `deployed.md` with production details:
	- [x] Memory-Optimized Configuration: Production memory profile and optimization results
	- [x] Performance Metrics: Real-time memory monitoring and capacity analysis
	- [x] Production Features: Memory management system and error handling documentation
	- [x] Deployment Pipeline: CI/CD integration with memory validation
	- [x] Contributing Guidelines: ✅ COMPLETED - Updated `CONTRIBUTING.md` with memory-conscious development:
	- [x] Memory Development Principles: Guidelines for memory-efficient code patterns
	- [x] Memory Testing Procedures: Development workflow for memory constraint validation
	- [x] Code Review Guidelines: Memory-focused review checklist and best practices
	- [x] Production Testing: Memory leak detection and performance validation procedures
	- [ ] Demonstration Video: Record a 5-10 minute screen-share video demonstrating the deployed application, walking through the code architecture, explaining the evaluation results, and showing a successful CI/CD run.
	- [ ] Submission: Share the GitHub repository with the grader and submit the repository and video links.