msse-ai-engineering / README.md
Tobias Pasquale
fix: resolve trailing whitespace and end-of-file formatting issues
4495e64
|
raw
history blame
9.85 kB

MSSE AI Engineering Project

This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses.

Features

Current Implementation (Phase 2B):

  • βœ… Document Ingestion: Process and chunk corporate policy documents with metadata tracking
  • βœ… Embedding Generation: Convert text chunks to vector embeddings using sentence-transformers
  • βœ… Vector Storage: Persistent storage using ChromaDB for similarity search
  • βœ… Semantic Search API: REST endpoint for finding relevant document chunks
  • βœ… End-to-End Testing: Comprehensive test suite validating the complete pipeline

Upcoming (Phase 3):

  • 🚧 RAG Implementation: LLM integration for generating contextual responses
  • 🚧 Quality Evaluation: Metrics and assessment tools for response quality

API Documentation

Document Ingestion

POST /ingest

Process and embed documents from the synthetic policies directory.

curl -X POST http://localhost:5000/ingest \
  -H "Content-Type: application/json" \
  -d '{"store_embeddings": true}'

Response:

{
  "status": "success",
  "chunks_processed": 98,
  "files_processed": 22,
  "embeddings_stored": 98,
  "processing_time_seconds": 15.3,
  "message": "Successfully processed and embedded 98 chunks"
}

Semantic Search

POST /search

Find relevant document chunks using semantic similarity.

curl -X POST http://localhost:5000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the remote work policy?",
    "top_k": 5,
    "threshold": 0.3
  }'

Response:

{
  "status": "success",
  "query": "What is the remote work policy?",
  "results_count": 3,
  "results": [
    {
      "chunk_id": "remote_work_policy_chunk_2",
      "content": "Employees may work remotely up to 3 days per week...",
      "similarity_score": 0.87,
      "metadata": {
        "filename": "remote_work_policy.md",
        "chunk_index": 2
      }
    }
  ]
}

Parameters:

  • query (required): Text query to search for
  • top_k (optional): Maximum number of results to return (default: 5, max: 20)
  • threshold (optional): Minimum similarity score threshold (default: 0.3)

Corpus

The application uses a synthetic corpus of corporate policy documents located in the synthetic_policies/ directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content).

Setup

  1. Clone the repository:

    git clone https://github.com/sethmcknight/msse-ai-engineering.git
    cd msse-ai-engineering
    
  2. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
    
  3. Install the dependencies:

    pip install -r requirements.txt
    

Running the Application (local)

To run the Flask application locally:

export FLASK_APP=app.py
flask run

The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints:

  • GET / - Basic application info
  • GET /health - Health check endpoint
  • POST /ingest - Document ingestion with embedding generation
  • POST /search - Semantic search for relevant documents

Quick Start Workflow

  1. Start the application:

    flask run
    
  2. Ingest and embed documents:

    curl -X POST http://localhost:5000/ingest \
      -H "Content-Type: application/json" \
      -d '{"store_embeddings": true}'
    
  3. Search for relevant content:

    curl -X POST http://localhost:5000/search \
      -H "Content-Type: application/json" \
      -d '{
        "query": "remote work policy",
        "top_k": 3,
        "threshold": 0.3
      }'
    

Architecture

The application follows a modular architecture with clear separation of concerns:

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ingestion/          # Document processing and chunking
β”‚   β”‚   β”œβ”€β”€ document_parser.py      # File parsing (Markdown, text)
β”‚   β”‚   β”œβ”€β”€ document_chunker.py     # Text chunking with overlap
β”‚   β”‚   └── ingestion_pipeline.py   # Complete ingestion workflow
β”‚   β”œβ”€β”€ embedding/          # Text embedding generation
β”‚   β”‚   └── embedding_service.py    # Sentence-transformer integration
β”‚   β”œβ”€β”€ vector_store/       # Vector database operations
β”‚   β”‚   └── vector_db.py           # ChromaDB interface
β”‚   β”œβ”€β”€ search/            # Semantic search functionality
β”‚   β”‚   └── search_service.py      # Search with similarity scoring
β”‚   └── config.py          # Application configuration
β”œβ”€β”€ tests/                 # Comprehensive test suite
β”œβ”€β”€ synthetic_policies/    # Corporate policy corpus
└── app.py                # Flask application entry point

Performance

Benchmark Results (Phase 2B):

  • Ingestion Rate: ~6-8 chunks/second for embedding generation
  • Search Response Time: < 1 second for semantic queries
  • Database Size: ~0.05MB per chunk (including metadata)
  • Memory Usage: Efficient batch processing with 32-chunk batches

Running Tests

To run the complete test suite:

pytest

Test Coverage:

  • Unit Tests: Individual component testing (embedding, vector store, search, ingestion)
  • Integration Tests: Component interaction validation
  • End-to-End Tests: Complete pipeline testing (ingestion β†’ embedding β†’ search)
  • API Tests: Flask endpoint validation and error handling
  • Performance Tests: Benchmarking and quality validation

Test Statistics:

  • 60+ comprehensive tests covering all components
  • End-to-end pipeline validation with real data
  • Search quality metrics and performance benchmarks
  • Complete error handling and edge case coverage

Key Test Suites:

# Run specific test suites
pytest tests/test_embedding/              # Embedding service tests
pytest tests/test_vector_store/           # Vector database tests
pytest tests/test_search/                 # Search functionality tests
pytest tests/test_ingestion/              # Document processing tests
pytest tests/test_integration/            # End-to-end pipeline tests
pytest tests/test_app.py                  # Flask API tests

Local Development Infrastructure

For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the dev-tools/ directory:

Quick Commands (via Makefile)

make help        # Show all available commands
make format      # Auto-format code (black + isort)
make check       # Check formatting only
make test        # Run test suite
make ci-check    # Full CI/CD pipeline simulation
make clean       # Clean cache files

Full CI/CD Check Before Push

To prevent GitHub Actions failures, always run the full CI check locally:

make ci-check

This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do.

Development Workflow

# 1. Make your changes
# 2. Format and check
make format && make ci-check

# 3. If everything passes, commit and push
git add .
git commit -m "Your commit message"
git push origin your-branch

For detailed information about the development tools, see dev-tools/README.md.

Development Progress

For detailed development progress, implementation decisions, and technical changes, see CHANGELOG.md. The changelog provides:

  • Chronological development history
  • Technical implementation details
  • Test results and coverage metrics
  • Component integration status
  • Performance benchmarks and optimization notes

This helps team members stay aligned on project progress and understand the evolution of the codebase.

CI/CD and Deployment

This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to main, the workflow triggers a Render deploy and runs a post-deploy smoke test against /health.

If you are deploying to Render manually:

  • Create a Web Service in Render (Environment: Docker).
  • Dockerfile Path: Dockerfile
  • Build Context: .
  • Health Check Path: /health
  • Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys)

To enable automated deploys from GitHub Actions, set these repository secrets in GitHub:

  • RENDER_API_KEY β€” Render API key
  • RENDER_SERVICE_ID β€” Render service id
  • RENDER_SERVICE_URL β€” Render public URL (used for smoke tests)

The workflow will create a small deploy-update-<ts> branch with an updated deployed.md after a successful deploy; that commit is marked with [skip-deploy] so merging it will not trigger another deploy.

Notes

  • run.sh binds Gunicorn to the PORT environment variable so it works on Render.
  • The Dockerfile copies only runtime files and uses .dockerignore to avoid including development artifacts.

Next steps

  • Add ingestion, embedding, and RAG components (with tests). See project-plan.md for detailed milestones.

Developer tooling

To keep the codebase formatted and linted automatically, we use pre-commit hooks.

  1. Create and activate your virtualenv (see Setup above).
  2. Install developer dependencies:
pip install -r dev-requirements.txt
  1. Install the hooks (runs once per clone):
pre-commit install
  1. To run all hooks locally (for example before pushing):
pre-commit run --all-files

CI has a dedicated pre-commit-check job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job.