Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

File size: 9,848 Bytes

# MSSE AI Engineering Project

This project is a Retrieval-Augmented Generation (RAG) application that answers questions about a corpus of company policies using semantic search and AI-powered responses.

## Features

**Current Implementation (Phase 2B):**
- ✅ **Document Ingestion**: Process and chunk corporate policy documents with metadata tracking
- ✅ **Embedding Generation**: Convert text chunks to vector embeddings using sentence-transformers
- ✅ **Vector Storage**: Persistent storage using ChromaDB for similarity search
- ✅ **Semantic Search API**: REST endpoint for finding relevant document chunks
- ✅ **End-to-End Testing**: Comprehensive test suite validating the complete pipeline

**Upcoming (Phase 3):**
- 🚧 **RAG Implementation**: LLM integration for generating contextual responses
- 🚧 **Quality Evaluation**: Metrics and assessment tools for response quality

## API Documentation

### Document Ingestion

**POST /ingest**

Process and embed documents from the synthetic policies directory.

```bash
curl -X POST http://localhost:5000/ingest \
  -H "Content-Type: application/json" \
  -d '{"store_embeddings": true}'
```

**Response:**
```json
{
  "status": "success",
  "chunks_processed": 98,
  "files_processed": 22,
  "embeddings_stored": 98,
  "processing_time_seconds": 15.3,
  "message": "Successfully processed and embedded 98 chunks"
}
```

### Semantic Search

**POST /search**

Find relevant document chunks using semantic similarity.

```bash
curl -X POST http://localhost:5000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the remote work policy?",
    "top_k": 5,
    "threshold": 0.3
  }'
```

**Response:**
```json
{
  "status": "success",
  "query": "What is the remote work policy?",
  "results_count": 3,
  "results": [
    {
      "chunk_id": "remote_work_policy_chunk_2",
      "content": "Employees may work remotely up to 3 days per week...",
      "similarity_score": 0.87,
      "metadata": {
        "filename": "remote_work_policy.md",
        "chunk_index": 2
      }
    }
  ]
}
```

**Parameters:**
- `query` (required): Text query to search for
- `top_k` (optional): Maximum number of results to return (default: 5, max: 20)
- `threshold` (optional): Minimum similarity score threshold (default: 0.3)

## Corpus

The application uses a synthetic corpus of corporate policy documents located in the `synthetic_policies/` directory. This corpus contains 22 comprehensive policy documents covering HR, Finance, Security, Operations, and EHS topics, totaling approximately 10,600 words (about 42 pages of content).

## Setup

1. Clone the repository:

   ```bash
   git clone https://github.com/sethmcknight/msse-ai-engineering.git
   cd msse-ai-engineering
   ```

2. Create and activate a virtual environment:

   ```bash
   python3 -m venv venv
   source venv/bin/activate
   ```

3. Install the dependencies:
   ```bash
   pip install -r requirements.txt
   ```

## Running the Application (local)

To run the Flask application locally:

```bash
export FLASK_APP=app.py
flask run
```

The app will be available at http://127.0.0.1:5000/ and exposes the following endpoints:
- `GET /` - Basic application info
- `GET /health` - Health check endpoint
- `POST /ingest` - Document ingestion with embedding generation
- `POST /search` - Semantic search for relevant documents

### Quick Start Workflow

1. **Start the application:**
   ```bash
   flask run
   ```

2. **Ingest and embed documents:**
   ```bash
   curl -X POST http://localhost:5000/ingest \
     -H "Content-Type: application/json" \
     -d '{"store_embeddings": true}'
   ```

3. **Search for relevant content:**
   ```bash
   curl -X POST http://localhost:5000/search \
     -H "Content-Type: application/json" \
     -d '{
       "query": "remote work policy",
       "top_k": 3,
       "threshold": 0.3
     }'
   ```

## Architecture

The application follows a modular architecture with clear separation of concerns:

```
├── src/
│   ├── ingestion/          # Document processing and chunking
│   │   ├── document_parser.py      # File parsing (Markdown, text)
│   │   ├── document_chunker.py     # Text chunking with overlap
│   │   └── ingestion_pipeline.py   # Complete ingestion workflow
│   ├── embedding/          # Text embedding generation
│   │   └── embedding_service.py    # Sentence-transformer integration
│   ├── vector_store/       # Vector database operations
│   │   └── vector_db.py           # ChromaDB interface
│   ├── search/            # Semantic search functionality
│   │   └── search_service.py      # Search with similarity scoring
│   └── config.py          # Application configuration
├── tests/                 # Comprehensive test suite
├── synthetic_policies/    # Corporate policy corpus
└── app.py                # Flask application entry point
```

## Performance

**Benchmark Results (Phase 2B):**
- **Ingestion Rate**: ~6-8 chunks/second for embedding generation
- **Search Response Time**: < 1 second for semantic queries
- **Database Size**: ~0.05MB per chunk (including metadata)
- **Memory Usage**: Efficient batch processing with 32-chunk batches

## Running Tests

To run the complete test suite:

```bash
pytest
```

**Test Coverage:**
- **Unit Tests**: Individual component testing (embedding, vector store, search, ingestion)
- **Integration Tests**: Component interaction validation
- **End-to-End Tests**: Complete pipeline testing (ingestion → embedding → search)
- **API Tests**: Flask endpoint validation and error handling
- **Performance Tests**: Benchmarking and quality validation

**Test Statistics:**
- 60+ comprehensive tests covering all components
- End-to-end pipeline validation with real data
- Search quality metrics and performance benchmarks
- Complete error handling and edge case coverage

**Key Test Suites:**
```bash
# Run specific test suites
pytest tests/test_embedding/              # Embedding service tests
pytest tests/test_vector_store/           # Vector database tests
pytest tests/test_search/                 # Search functionality tests
pytest tests/test_ingestion/              # Document processing tests
pytest tests/test_integration/            # End-to-end pipeline tests
pytest tests/test_app.py                  # Flask API tests
```

## Local Development Infrastructure

For consistent code quality and to prevent CI/CD pipeline failures, we provide local development tools in the `dev-tools/` directory:

### Quick Commands (via Makefile)

```bash
make help        # Show all available commands
make format      # Auto-format code (black + isort)
make check       # Check formatting only
make test        # Run test suite
make ci-check    # Full CI/CD pipeline simulation
make clean       # Clean cache files
```

### Full CI/CD Check Before Push

To prevent GitHub Actions failures, always run the full CI check locally:

```bash
make ci-check
```

This runs the complete pipeline: formatting checks, import sorting, linting, and all tests - exactly matching what GitHub Actions will do.

### Development Workflow

```bash
# 1. Make your changes
# 2. Format and check
make format && make ci-check

# 3. If everything passes, commit and push
git add .
git commit -m "Your commit message"
git push origin your-branch
```

For detailed information about the development tools, see [`dev-tools/README.md`](./dev-tools/README.md).

## Development Progress

For detailed development progress, implementation decisions, and technical changes, see [`CHANGELOG.md`](./CHANGELOG.md). The changelog provides:

- Chronological development history
- Technical implementation details
- Test results and coverage metrics
- Component integration status
- Performance benchmarks and optimization notes

This helps team members stay aligned on project progress and understand the evolution of the codebase.

## CI/CD and Deployment

This repository includes a GitHub Actions workflow that runs tests on push and pull requests. After merging to `main`, the workflow triggers a Render deploy and runs a post-deploy smoke test against `/health`.

If you are deploying to Render manually:

- Create a Web Service in Render (Environment: Docker).
- Dockerfile Path: `Dockerfile`
- Build Context: `.`
- Health Check Path: `/health`
- Auto-Deploy: Off (recommended if you want GitHub Actions to trigger deploys)

To enable automated deploys from GitHub Actions, set these repository secrets in GitHub:

- `RENDER_API_KEY` — Render API key
- `RENDER_SERVICE_ID` — Render service id
- `RENDER_SERVICE_URL` — Render public URL (used for smoke tests)

The workflow will create a small `deploy-update-<ts>` branch with an updated `deployed.md` after a successful deploy; that commit is marked with `[skip-deploy]` so merging it will not trigger another deploy.

## Notes

- `run.sh` binds Gunicorn to the `PORT` environment variable so it works on Render.
- The Dockerfile copies only runtime files and uses `.dockerignore` to avoid including development artifacts.

## Next steps

- Add ingestion, embedding, and RAG components (with tests). See `project-plan.md` for detailed milestones.

## Developer tooling

To keep the codebase formatted and linted automatically, we use pre-commit hooks.

1. Create and activate your virtualenv (see Setup above).
2. Install developer dependencies:

```bash
pip install -r dev-requirements.txt
```

3. Install the hooks (runs once per clone):

```bash
pre-commit install
```

4. To run all hooks locally (for example before pushing):

```bash
pre-commit run --all-files
```

CI has a dedicated `pre-commit-check` job that runs on pull requests and will fail the PR if any hook fails. We also run formatters and tests in the main build job.