Spaces:
Sleeping
MSSE AI Engineering Project
This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses.
Features
Current Implementation (Phase 2B):
- β Document Ingestion: Process and chunk corporate policy documents with metadata tracking
- β Embedding Generation: Convert text chunks to vector embeddings using sentence-transformers
- β Vector Storage: Persistent storage using ChromaDB for similarity search
- β Semantic Search API: REST endpoint for finding relevant document chunks
- β End-to-End Testing: Comprehensive test suite validating the complete pipeline
Upcoming (Phase 3):
- π§ RAG Implementation: LLM integration for generating contextual responses
- π§ Quality Evaluation: Metrics and assessment tools for response quality
API Documentation
Document Ingestion
POST /ingest
Process and embed documents from the synthetic policies directory.
curl -X POST http://localhost:5000/ingest \
-H "Content-Type: application/json" \
-d '{"store_embeddings": true}'
Response:
{
"status": "success",
"chunks_processed": 98,
"files_processed": 22,
"embeddings_stored": 98,
"processing_time_seconds": 15.3,
"message": "Successfully processed and embedded 98 chunks"
}
Semantic Search
POST /search
Find relevant document chunks using semantic similarity.
curl -X POST http://localhost:5000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is the remote work policy?",
"top_k": 5,
"threshold": 0.3
}'
Response:
{
"status": "success",
"query": "What is the remote work policy?",
"results_count": 3,
"results": [
{
"chunk_id": "remote_work_policy_chunk_2",
"content": "Employees may work remotely up to 3 days per week...",
"similarity_score": 0.87,
"metadata": {
"filename": "remote_work_policy.md",
"chunk_index": 2
}
}
]
}
Parameters:
query(required): Text query to search fortop_k(optional): Maximum number of results to return (default: 5, max: 20)threshold(optional): Minimum similarity score threshold (default: 0.3)
Corpus
The application uses a synthetic corpus of corporate policy documents located in the synthetic_policies/ directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content).
Setup
Clone the repository:
git clone https://github.com/sethmcknight/msse-ai-engineering.git cd msse-ai-engineeringCreate and activate a virtual environment:
python3 -m venv venv source venv/bin/activateInstall the dependencies:
pip install -r requirements.txt
Running the Application (local)
To run the Flask application locally:
export FLASK_APP=app.py
flask run
The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints:
GET /- Basic application infoGET /health- Health check endpointPOST /ingest- Document ingestion with embedding generationPOST /search- Semantic search for relevant documents
Quick Start Workflow
Start the application:
flask runIngest and embed documents:
curl -X POST http://localhost:5000/ingest \ -H "Content-Type: application/json" \ -d '{"store_embeddings": true}'Search for relevant content:
curl -X POST http://localhost:5000/search \ -H "Content-Type: application/json" \ -d '{ "query": "remote work policy", "top_k": 3, "threshold": 0.3 }'
Architecture
The application follows a modular architecture with clear separation of concerns:
βββ src/
β βββ ingestion/ # Document processing and chunking
β β βββ document_parser.py # File parsing (Markdown, text)
β β βββ document_chunker.py # Text chunking with overlap
β β βββ ingestion_pipeline.py # Complete ingestion workflow
β βββ embedding/ # Text embedding generation
β β βββ embedding_service.py # Sentence-transformer integration
β βββ vector_store/ # Vector database operations
β β βββ vector_db.py # ChromaDB interface
β βββ search/ # Semantic search functionality
β β βββ search_service.py # Search with similarity scoring
β βββ config.py # Application configuration
βββ tests/ # Comprehensive test suite
βββ synthetic_policies/ # Corporate policy corpus
βββ app.py # Flask application entry point
Performance
Benchmark Results (Phase 2B):
- Ingestion Rate: ~6-8 chunks/second for embedding generation
- Search Response Time: < 1 second for semantic queries
- Database Size: ~0.05MB per chunk (including metadata)
- Memory Usage: Efficient batch processing with 32-chunk batches
Running Tests
To run the complete test suite:
pytest
Test Coverage:
- Unit Tests: Individual component testing (embedding, vector store, search, ingestion)
- Integration Tests: Component interaction validation
- End-to-End Tests: Complete pipeline testing (ingestion β embedding β search)
- API Tests: Flask endpoint validation and error handling
- Performance Tests: Benchmarking and quality validation
Test Statistics:
- 60+ comprehensive tests covering all components
- End-to-end pipeline validation with real data
- Search quality metrics and performance benchmarks
- Complete error handling and edge case coverage
Key Test Suites:
# Run specific test suites
pytest tests/test_embedding/ # Embedding service tests
pytest tests/test_vector_store/ # Vector database tests
pytest tests/test_search/ # Search functionality tests
pytest tests/test_ingestion/ # Document processing tests
pytest tests/test_integration/ # End-to-end pipeline tests
pytest tests/test_app.py # Flask API tests
Local Development Infrastructure
For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the dev-tools/ directory:
Quick Commands (via Makefile)
make help # Show all available commands
make format # Auto-format code (black + isort)
make check # Check formatting only
make test # Run test suite
make ci-check # Full CI/CD pipeline simulation
make clean # Clean cache files
Full CI/CD Check Before Push
To prevent GitHub Actions failures, always run the full CI check locally:
make ci-check
This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do.
Development Workflow
# 1. Make your changes
# 2. Format and check
make format && make ci-check
# 3. If everything passes, commit and push
git add .
git commit -m "Your commit message"
git push origin your-branch
For detailed information about the development tools, see dev-tools/README.md.
Development Progress
For detailed development progress, implementation decisions, and technical changes, see CHANGELOG.md. The changelog provides:
- Chronological development history
- Technical implementation details
- Test results and coverage metrics
- Component integration status
- Performance benchmarks and optimization notes
This helps team members stay aligned on project progress and understand the evolution of the codebase.
CI/CD and Deployment
This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to main, the workflow triggers a Render deploy and runs a post-deploy smoke test against /health.
If you are deploying to Render manually:
- Create a Web Service in Render (Environment: Docker).
- Dockerfile Path:
Dockerfile - Build Context:
. - Health Check Path:
/health - Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys)
To enable automated deploys from GitHub Actions, set these repository secrets in GitHub:
RENDER_API_KEYβ Render API keyRENDER_SERVICE_IDβ Render service idRENDER_SERVICE_URLβ Render public URL (used for smoke tests)
The workflow will create a small deploy-update-<ts> branch with an updated deployed.md after a successful deploy; that commit is marked with [skip-deploy] so merging it will not trigger another deploy.
Notes
run.shbinds Gunicorn to thePORTenvironment variable so it works on Render.- The Dockerfile copies only runtime files and uses
.dockerignoreto avoid including development artifacts.
Next steps
- Add ingestion, embedding, and RAG components (with tests). See
project-plan.mdfor detailed milestones.
Developer tooling
To keep the codebase formatted and linted automatically, we use pre-commit hooks.
- Create and activate your virtualenv (see Setup above).
- Install developer dependencies:
pip install -r dev-requirements.txt
- Install the hooks (runs once per clone):
pre-commit install
- To run all hooks locally (for example before pushing):
pre-commit run --all-files
CI has a dedicated pre-commit-check job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job.