msse-ai-engineering / README.md
Tobias Pasquale
fix: resolve trailing whitespace and end-of-file formatting issues
4495e64
|
raw
history blame
9.85 kB
# MSSE AI Engineering Project
This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses.
## Features
**Current Implementation (Phase 2B):**
- βœ… **Document Ingestion**: Process and chunk corporate policy documents with metadata tracking
- βœ… **Embedding Generation**: Convert text chunks to vector embeddings using sentence-transformers
- βœ… **Vector Storage**: Persistent storage using ChromaDB for similarity search
- βœ… **Semantic Search API**: REST endpoint for finding relevant document chunks
- βœ… **End-to-End Testing**: Comprehensive test suite validating the complete pipeline
**Upcoming (Phase 3):**
- 🚧 **RAG Implementation**: LLM integration for generating contextual responses
- 🚧 **Quality Evaluation**: Metrics and assessment tools for response quality
## API Documentation
### Document Ingestion
**POST /ingest**
Process and embed documents from the synthetic policies directory.
```bash
curl -X POST http://localhost:5000/ingest \
-H "Content-Type: application/json" \
-d '{"store_embeddings": true}'
```
**Response:**
```json
{
"status": "success",
"chunks_processed": 98,
"files_processed": 22,
"embeddings_stored": 98,
"processing_time_seconds": 15.3,
"message": "Successfully processed and embedded 98 chunks"
}
```
### Semantic Search
**POST /search**
Find relevant document chunks using semantic similarity.
```bash
curl -X POST http://localhost:5000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is the remote work policy?",
"top_k": 5,
"threshold": 0.3
}'
```
**Response:**
```json
{
"status": "success",
"query": "What is the remote work policy?",
"results_count": 3,
"results": [
{
"chunk_id": "remote_work_policy_chunk_2",
"content": "Employees may work remotely up to 3 days per week...",
"similarity_score": 0.87,
"metadata": {
"filename": "remote_work_policy.md",
"chunk_index": 2
}
}
]
}
```
**Parameters:**
- `query` (required): Text query to search for
- `top_k` (optional): Maximum number of results to return (default: 5, max: 20)
- `threshold` (optional): Minimum similarity score threshold (default: 0.3)
## Corpus
The application uses a synthetic corpus of corporate policy documents located in the `synthetic_policies/` directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content).
## Setup
1. Clone the repository:
```bash
git clone https://github.com/sethmcknight/msse-ai-engineering.git
cd msse-ai-engineering
```
2. Create and activate a virtual environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
3. Install the dependencies:
```bash
pip install -r requirements.txt
```
## Running the Application (local)
To run the Flask application locally:
```bash
export FLASK_APP=app.py
flask run
```
The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints:
- `GET /` - Basic application info
- `GET /health` - Health check endpoint
- `POST /ingest` - Document ingestion with embedding generation
- `POST /search` - Semantic search for relevant documents
### Quick Start Workflow
1. **Start the application:**
```bash
flask run
```
2. **Ingest and embed documents:**
```bash
curl -X POST http://localhost:5000/ingest \
-H "Content-Type: application/json" \
-d '{"store_embeddings": true}'
```
3. **Search for relevant content:**
```bash
curl -X POST http://localhost:5000/search \
-H "Content-Type: application/json" \
-d '{
"query": "remote work policy",
"top_k": 3,
"threshold": 0.3
}'
```
## Architecture
The application follows a modular architecture with clear separation of concerns:
```
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ ingestion/ # Document processing and chunking
β”‚ β”‚ β”œβ”€β”€ document_parser.py # File parsing (Markdown, text)
β”‚ β”‚ β”œβ”€β”€ document_chunker.py # Text chunking with overlap
β”‚ β”‚ └── ingestion_pipeline.py # Complete ingestion workflow
β”‚ β”œβ”€β”€ embedding/ # Text embedding generation
β”‚ β”‚ └── embedding_service.py # Sentence-transformer integration
β”‚ β”œβ”€β”€ vector_store/ # Vector database operations
β”‚ β”‚ └── vector_db.py # ChromaDB interface
β”‚ β”œβ”€β”€ search/ # Semantic search functionality
β”‚ β”‚ └── search_service.py # Search with similarity scoring
β”‚ └── config.py # Application configuration
β”œβ”€β”€ tests/ # Comprehensive test suite
β”œβ”€β”€ synthetic_policies/ # Corporate policy corpus
└── app.py # Flask application entry point
```
## Performance
**Benchmark Results (Phase 2B):**
- **Ingestion Rate**: ~6-8 chunks/second for embedding generation
- **Search Response Time**: < 1 second for semantic queries
- **Database Size**: ~0.05MB per chunk (including metadata)
- **Memory Usage**: Efficient batch processing with 32-chunk batches
## Running Tests
To run the complete test suite:
```bash
pytest
```
**Test Coverage:**
- **Unit Tests**: Individual component testing (embedding, vector store, search, ingestion)
- **Integration Tests**: Component interaction validation
- **End-to-End Tests**: Complete pipeline testing (ingestion β†’ embedding β†’ search)
- **API Tests**: Flask endpoint validation and error handling
- **Performance Tests**: Benchmarking and quality validation
**Test Statistics:**
- 60+ comprehensive tests covering all components
- End-to-end pipeline validation with real data
- Search quality metrics and performance benchmarks
- Complete error handling and edge case coverage
**Key Test Suites:**
```bash
# Run specific test suites
pytest tests/test_embedding/ # Embedding service tests
pytest tests/test_vector_store/ # Vector database tests
pytest tests/test_search/ # Search functionality tests
pytest tests/test_ingestion/ # Document processing tests
pytest tests/test_integration/ # End-to-end pipeline tests
pytest tests/test_app.py # Flask API tests
```
## Local Development Infrastructure
For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the `dev-tools/` directory:
### Quick Commands (via Makefile)
```bash
make help # Show all available commands
make format # Auto-format code (black + isort)
make check # Check formatting only
make test # Run test suite
make ci-check # Full CI/CD pipeline simulation
make clean # Clean cache files
```
### Full CI/CD Check Before Push
To prevent GitHub Actions failures, always run the full CI check locally:
```bash
make ci-check
```
This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do.
### Development Workflow
```bash
# 1. Make your changes
# 2. Format and check
make format && make ci-check
# 3. If everything passes, commit and push
git add .
git commit -m "Your commit message"
git push origin your-branch
```
For detailed information about the development tools, see [`dev-tools/README.md`](./dev-tools/README.md).
## Development Progress
For detailed development progress, implementation decisions, and technical changes, see [`CHANGELOG.md`](./CHANGELOG.md). The changelog provides:
- Chronological development history
- Technical implementation details
- Test results and coverage metrics
- Component integration status
- Performance benchmarks and optimization notes
This helps team members stay aligned on project progress and understand the evolution of the codebase.
## CI/CD and Deployment
This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to `main`, the workflow triggers a Render deploy and runs a post-deploy smoke test against `/health`.
If you are deploying to Render manually:
- Create a Web Service in Render (Environment: Docker).
- Dockerfile Path: `Dockerfile`
- Build Context: `.`
- Health Check Path: `/health`
- Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys)
To enable automated deploys from GitHub Actions, set these repository secrets in GitHub:
- `RENDER_API_KEY` β€” Render API key
- `RENDER_SERVICE_ID` β€” Render service id
- `RENDER_SERVICE_URL` β€” Render public URL (used for smoke tests)
The workflow will create a small `deploy-update-<ts>` branch with an updated `deployed.md` after a successful deploy; that commit is marked with `[skip-deploy]` so merging it will not trigger another deploy.
## Notes
- `run.sh` binds Gunicorn to the `PORT` environment variable so it works on Render.
- The Dockerfile copies only runtime files and uses `.dockerignore` to avoid including development artifacts.
## Next steps
- Add ingestion, embedding, and RAG components (with tests). See `project-plan.md` for detailed milestones.
## Developer tooling
To keep the codebase formatted and linted automatically, we use pre-commit hooks.
1. Create and activate your virtualenv (see Setup above).
2. Install developer dependencies:
```bash
pip install -r dev-requirements.txt
```
3. Install the hooks (runs once per clone):
```bash
pre-commit install
```
4. To run all hooks locally (for example before pushing):
```bash
pre-commit run --all-files
```
CI has a dedicated `pre-commit-check` job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job.