Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / README.md

Tobias Pasquale

fix: resolve trailing whitespace and end-of-file formatting issues

4495e64 2 months ago

preview code

raw

history blame

9.85 kB

	# MSSE AI Engineering Project

	This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses.

	## Features

	Current Implementation (Phase 2B):
	- ✅ Document Ingestion: Process and chunk corporate policy documents with metadata tracking
	- ✅ Embedding Generation: Convert text chunks to vector embeddings using sentence-transformers
	- ✅ Vector Storage: Persistent storage using ChromaDB for similarity search
	- ✅ Semantic Search API: REST endpoint for finding relevant document chunks
	- ✅ End-to-End Testing: Comprehensive test suite validating the complete pipeline

	Upcoming (Phase 3):
	- 🚧 RAG Implementation: LLM integration for generating contextual responses
	- 🚧 Quality Evaluation: Metrics and assessment tools for response quality

	## API Documentation

	### Document Ingestion

	POST /ingest

	Process and embed documents from the synthetic policies directory.

	```bash
	curl -X POST http://localhost:5000/ingest \
	-H "Content-Type: application/json" \
	-d '{"store_embeddings": true}'
	```

	Response:
	```json
	{
	"status": "success",
	"chunks_processed": 98,
	"files_processed": 22,
	"embeddings_stored": 98,
	"processing_time_seconds": 15.3,
	"message": "Successfully processed and embedded 98 chunks"
	}
	```

	### Semantic Search

	POST /search

	Find relevant document chunks using semantic similarity.

	```bash
	curl -X POST http://localhost:5000/search \
	-H "Content-Type: application/json" \
	-d '{
	"query": "What is the remote work policy?",
	"top_k": 5,
	"threshold": 0.3
	}'
	```

	Response:
	```json
	{
	"status": "success",
	"query": "What is the remote work policy?",
	"results_count": 3,
	"results": [
	{
	"chunk_id": "remote_work_policy_chunk_2",
	"content": "Employees may work remotely up to 3 days per week...",
	"similarity_score": 0.87,
	"metadata": {
	"filename": "remote_work_policy.md",
	"chunk_index": 2
	}
	}
	]
	}
	```

	Parameters:
	- `query` (required): Text query to search for
	- `top_k` (optional): Maximum number of results to return (default: 5, max: 20)
	- `threshold` (optional): Minimum similarity score threshold (default: 0.3)

	## Corpus

	The application uses a synthetic corpus of corporate policy documents located in the `synthetic_policies/` directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content).

	## Setup

	1. Clone the repository:

	```bash
	git clone https://github.com/sethmcknight/msse-ai-engineering.git
	cd msse-ai-engineering
	```

	2. Create and activate a virtual environment:

	```bash
	python3 -m venv venv
	source venv/bin/activate
	```

	3. Install the dependencies:
	```bash
	pip install -r requirements.txt
	```

	## Running the Application (local)

	To run the Flask application locally:

	```bash
	export FLASK_APP=app.py
	flask run
	```

	The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints:
	- `GET /` - Basic application info
	- `GET /health` - Health check endpoint
	- `POST /ingest` - Document ingestion with embedding generation
	- `POST /search` - Semantic search for relevant documents

	### Quick Start Workflow

	1. Start the application:
	```bash
	flask run
	```

	2. Ingest and embed documents:
	```bash
	curl -X POST http://localhost:5000/ingest \
	-H "Content-Type: application/json" \
	-d '{"store_embeddings": true}'
	```

	3. Search for relevant content:
	```bash
	curl -X POST http://localhost:5000/search \
	-H "Content-Type: application/json" \
	-d '{
	"query": "remote work policy",
	"top_k": 3,
	"threshold": 0.3
	}'
	```

	## Architecture

	The application follows a modular architecture with clear separation of concerns:

	```
	├── src/
	│ ├── ingestion/ # Document processing and chunking
	│ │ ├── document_parser.py # File parsing (Markdown, text)
	│ │ ├── document_chunker.py # Text chunking with overlap
	│ │ └── ingestion_pipeline.py # Complete ingestion workflow
	│ ├── embedding/ # Text embedding generation
	│ │ └── embedding_service.py # Sentence-transformer integration
	│ ├── vector_store/ # Vector database operations
	│ │ └── vector_db.py # ChromaDB interface
	│ ├── search/ # Semantic search functionality
	│ │ └── search_service.py # Search with similarity scoring
	│ └── config.py # Application configuration
	├── tests/ # Comprehensive test suite
	├── synthetic_policies/ # Corporate policy corpus
	└── app.py # Flask application entry point
	```

	## Performance

	Benchmark Results (Phase 2B):
	- Ingestion Rate: ~6-8 chunks/second for embedding generation
	- Search Response Time: < 1 second for semantic queries
	- Database Size: ~0.05MB per chunk (including metadata)
	- Memory Usage: Efficient batch processing with 32-chunk batches

	## Running Tests

	To run the complete test suite:

	```bash
	pytest
	```

	Test Coverage:
	- Unit Tests: Individual component testing (embedding, vector store, search, ingestion)
	- Integration Tests: Component interaction validation
	- End-to-End Tests: Complete pipeline testing (ingestion → embedding → search)
	- API Tests: Flask endpoint validation and error handling
	- Performance Tests: Benchmarking and quality validation

	Test Statistics:
	- 60+ comprehensive tests covering all components
	- End-to-end pipeline validation with real data
	- Search quality metrics and performance benchmarks
	- Complete error handling and edge case coverage

	Key Test Suites:
	```bash
	# Run specific test suites
	pytest tests/test_embedding/ # Embedding service tests
	pytest tests/test_vector_store/ # Vector database tests
	pytest tests/test_search/ # Search functionality tests
	pytest tests/test_ingestion/ # Document processing tests
	pytest tests/test_integration/ # End-to-end pipeline tests
	pytest tests/test_app.py # Flask API tests
	```

	## Local Development Infrastructure

	For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the `dev-tools/` directory:

	### Quick Commands (via Makefile)

	```bash
	make help # Show all available commands
	make format # Auto-format code (black + isort)
	make check # Check formatting only
	make test # Run test suite
	make ci-check # Full CI/CD pipeline simulation
	make clean # Clean cache files
	```

	### Full CI/CD Check Before Push

	To prevent GitHub Actions failures, always run the full CI check locally:

	```bash
	make ci-check
	```

	This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do.

	### Development Workflow

	```bash
	# 1. Make your changes
	# 2. Format and check
	make format && make ci-check

	# 3. If everything passes, commit and push
	git add .
	git commit -m "Your commit message"
	git push origin your-branch
	```

	For detailed information about the development tools, see [`dev-tools/README.md`](./dev-tools/README.md).

	## Development Progress

	For detailed development progress, implementation decisions, and technical changes, see [`CHANGELOG.md`](./CHANGELOG.md). The changelog provides:

	- Chronological development history
	- Technical implementation details
	- Test results and coverage metrics
	- Component integration status
	- Performance benchmarks and optimization notes

	This helps team members stay aligned on project progress and understand the evolution of the codebase.

	## CI/CD and Deployment

	This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to `main`, the workflow triggers a Render deploy and runs a post-deploy smoke test against `/health`.

	If you are deploying to Render manually:

	- Create a Web Service in Render (Environment: Docker).
	- Dockerfile Path: `Dockerfile`
	- Build Context: `.`
	- Health Check Path: `/health`
	- Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys)

	To enable automated deploys from GitHub Actions, set these repository secrets in GitHub:

	- `RENDER_API_KEY` — Render API key
	- `RENDER_SERVICE_ID` — Render service id
	- `RENDER_SERVICE_URL` — Render public URL (used for smoke tests)

	The workflow will create a small `deploy-update-<ts>` branch with an updated `deployed.md` after a successful deploy; that commit is marked with `[skip-deploy]` so merging it will not trigger another deploy.

	## Notes

	- `run.sh` binds Gunicorn to the `PORT` environment variable so it works on Render.
	- The Dockerfile copies only runtime files and uses `.dockerignore` to avoid including development artifacts.

	## Next steps

	- Add ingestion, embedding, and RAG components (with tests). See `project-plan.md` for detailed milestones.

	## Developer tooling

	To keep the codebase formatted and linted automatically, we use pre-commit hooks.

	1. Create and activate your virtualenv (see Setup above).
	2. Install developer dependencies:

	```bash
	pip install -r dev-requirements.txt
	```

	3. Install the hooks (runs once per clone):

	```bash
	pre-commit install
	```

	4. To run all hooks locally (for example before pushing):

	```bash
	pre-commit run --all-files
	```

	CI has a dedicated `pre-commit-check` job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job.