Spaces:
Sleeping
Sleeping
| # MSSE AI Engineering Project | |
| This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses. | |
| ## Features | |
| **Current Implementation (Phase 2B):** | |
| - β **Document Ingestion**: Process and chunk corporate policy documents with metadata tracking | |
| - β **Embedding Generation**: Convert text chunks to vector embeddings using sentence-transformers | |
| - β **Vector Storage**: Persistent storage using ChromaDB for similarity search | |
| - β **Semantic Search API**: REST endpoint for finding relevant document chunks | |
| - β **End-to-End Testing**: Comprehensive test suite validating the complete pipeline | |
| **Upcoming (Phase 3):** | |
| - π§ **RAG Implementation**: LLM integration for generating contextual responses | |
| - π§ **Quality Evaluation**: Metrics and assessment tools for response quality | |
| ## API Documentation | |
| ### Document Ingestion | |
| **POST /ingest** | |
| Process and embed documents from the synthetic policies directory. | |
| ```bash | |
| curl -X POST http://localhost:5000/ingest \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"store_embeddings": true}' | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "status": "success", | |
| "chunks_processed": 98, | |
| "files_processed": 22, | |
| "embeddings_stored": 98, | |
| "processing_time_seconds": 15.3, | |
| "message": "Successfully processed and embedded 98 chunks" | |
| } | |
| ``` | |
| ### Semantic Search | |
| **POST /search** | |
| Find relevant document chunks using semantic similarity. | |
| ```bash | |
| curl -X POST http://localhost:5000/search \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "What is the remote work policy?", | |
| "top_k": 5, | |
| "threshold": 0.3 | |
| }' | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "status": "success", | |
| "query": "What is the remote work policy?", | |
| "results_count": 3, | |
| "results": [ | |
| { | |
| "chunk_id": "remote_work_policy_chunk_2", | |
| "content": "Employees may work remotely up to 3 days per week...", | |
| "similarity_score": 0.87, | |
| "metadata": { | |
| "filename": "remote_work_policy.md", | |
| "chunk_index": 2 | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| **Parameters:** | |
| - `query` (required): Text query to search for | |
| - `top_k` (optional): Maximum number of results to return (default: 5, max: 20) | |
| - `threshold` (optional): Minimum similarity score threshold (default: 0.3) | |
| ## Corpus | |
| The application uses a synthetic corpus of corporate policy documents located in the `synthetic_policies/` directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content). | |
| ## Setup | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/sethmcknight/msse-ai-engineering.git | |
| cd msse-ai-engineering | |
| ``` | |
| 2. Create and activate a virtual environment: | |
| ```bash | |
| python3 -m venv venv | |
| source venv/bin/activate | |
| ``` | |
| 3. Install the dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Running the Application (local) | |
| To run the Flask application locally: | |
| ```bash | |
| export FLASK_APP=app.py | |
| flask run | |
| ``` | |
| The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints: | |
| - `GET /` - Basic application info | |
| - `GET /health` - Health check endpoint | |
| - `POST /ingest` - Document ingestion with embedding generation | |
| - `POST /search` - Semantic search for relevant documents | |
| ### Quick Start Workflow | |
| 1. **Start the application:** | |
| ```bash | |
| flask run | |
| ``` | |
| 2. **Ingest and embed documents:** | |
| ```bash | |
| curl -X POST http://localhost:5000/ingest \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"store_embeddings": true}' | |
| ``` | |
| 3. **Search for relevant content:** | |
| ```bash | |
| curl -X POST http://localhost:5000/search \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "remote work policy", | |
| "top_k": 3, | |
| "threshold": 0.3 | |
| }' | |
| ``` | |
| ## Architecture | |
| The application follows a modular architecture with clear separation of concerns: | |
| ``` | |
| βββ src/ | |
| β βββ ingestion/ # Document processing and chunking | |
| β β βββ document_parser.py # File parsing (Markdown, text) | |
| β β βββ document_chunker.py # Text chunking with overlap | |
| β β βββ ingestion_pipeline.py # Complete ingestion workflow | |
| β βββ embedding/ # Text embedding generation | |
| β β βββ embedding_service.py # Sentence-transformer integration | |
| β βββ vector_store/ # Vector database operations | |
| β β βββ vector_db.py # ChromaDB interface | |
| β βββ search/ # Semantic search functionality | |
| β β βββ search_service.py # Search with similarity scoring | |
| β βββ config.py # Application configuration | |
| βββ tests/ # Comprehensive test suite | |
| βββ synthetic_policies/ # Corporate policy corpus | |
| βββ app.py # Flask application entry point | |
| ``` | |
| ## Performance | |
| **Benchmark Results (Phase 2B):** | |
| - **Ingestion Rate**: ~6-8 chunks/second for embedding generation | |
| - **Search Response Time**: < 1 second for semantic queries | |
| - **Database Size**: ~0.05MB per chunk (including metadata) | |
| - **Memory Usage**: Efficient batch processing with 32-chunk batches | |
| ## Running Tests | |
| To run the complete test suite: | |
| ```bash | |
| pytest | |
| ``` | |
| **Test Coverage:** | |
| - **Unit Tests**: Individual component testing (embedding, vector store, search, ingestion) | |
| - **Integration Tests**: Component interaction validation | |
| - **End-to-End Tests**: Complete pipeline testing (ingestion β embedding β search) | |
| - **API Tests**: Flask endpoint validation and error handling | |
| - **Performance Tests**: Benchmarking and quality validation | |
| **Test Statistics:** | |
| - 60+ comprehensive tests covering all components | |
| - End-to-end pipeline validation with real data | |
| - Search quality metrics and performance benchmarks | |
| - Complete error handling and edge case coverage | |
| **Key Test Suites:** | |
| ```bash | |
| # Run specific test suites | |
| pytest tests/test_embedding/ # Embedding service tests | |
| pytest tests/test_vector_store/ # Vector database tests | |
| pytest tests/test_search/ # Search functionality tests | |
| pytest tests/test_ingestion/ # Document processing tests | |
| pytest tests/test_integration/ # End-to-end pipeline tests | |
| pytest tests/test_app.py # Flask API tests | |
| ``` | |
| ## Local Development Infrastructure | |
| For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the `dev-tools/` directory: | |
| ### Quick Commands (via Makefile) | |
| ```bash | |
| make help # Show all available commands | |
| make format # Auto-format code (black + isort) | |
| make check # Check formatting only | |
| make test # Run test suite | |
| make ci-check # Full CI/CD pipeline simulation | |
| make clean # Clean cache files | |
| ``` | |
| ### Full CI/CD Check Before Push | |
| To prevent GitHub Actions failures, always run the full CI check locally: | |
| ```bash | |
| make ci-check | |
| ``` | |
| This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do. | |
| ### Development Workflow | |
| ```bash | |
| # 1. Make your changes | |
| # 2. Format and check | |
| make format && make ci-check | |
| # 3. If everything passes, commit and push | |
| git add . | |
| git commit -m "Your commit message" | |
| git push origin your-branch | |
| ``` | |
| For detailed information about the development tools, see [`dev-tools/README.md`](./dev-tools/README.md). | |
| ## Development Progress | |
| For detailed development progress, implementation decisions, and technical changes, see [`CHANGELOG.md`](./CHANGELOG.md). The changelog provides: | |
| - Chronological development history | |
| - Technical implementation details | |
| - Test results and coverage metrics | |
| - Component integration status | |
| - Performance benchmarks and optimization notes | |
| This helps team members stay aligned on project progress and understand the evolution of the codebase. | |
| ## CI/CD and Deployment | |
| This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to `main`, the workflow triggers a Render deploy and runs a post-deploy smoke test against `/health`. | |
| If you are deploying to Render manually: | |
| - Create a Web Service in Render (Environment: Docker). | |
| - Dockerfile Path: `Dockerfile` | |
| - Build Context: `.` | |
| - Health Check Path: `/health` | |
| - Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys) | |
| To enable automated deploys from GitHub Actions, set these repository secrets in GitHub: | |
| - `RENDER_API_KEY` β Render API key | |
| - `RENDER_SERVICE_ID` β Render service id | |
| - `RENDER_SERVICE_URL` β Render public URL (used for smoke tests) | |
| The workflow will create a small `deploy-update-<ts>` branch with an updated `deployed.md` after a successful deploy; that commit is marked with `[skip-deploy]` so merging it will not trigger another deploy. | |
| ## Notes | |
| - `run.sh` binds Gunicorn to the `PORT` environment variable so it works on Render. | |
| - The Dockerfile copies only runtime files and uses `.dockerignore` to avoid including development artifacts. | |
| ## Next steps | |
| - Add ingestion, embedding, and RAG components (with tests). See `project-plan.md` for detailed milestones. | |
| ## Developer tooling | |
| To keep the codebase formatted and linted automatically, we use pre-commit hooks. | |
| 1. Create and activate your virtualenv (see Setup above). | |
| 2. Install developer dependencies: | |
| ```bash | |
| pip install -r dev-requirements.txt | |
| ``` | |
| 3. Install the hooks (runs once per clone): | |
| ```bash | |
| pre-commit install | |
| ``` | |
| 4. To run all hooks locally (for example before pushing): | |
| ```bash | |
| pre-commit run --all-files | |
| ``` | |
| CI has a dedicated `pre-commit-check` job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job. | |