msse-ai-engineering / POSTGRES_MIGRATION.md
sethmcknight
refactor: enhance gunicorn startup script with health checks and config handling
45cf08e
# PostgreSQL Migration Guide
## Overview
This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.
## What's Been Implemented
### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)
- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info
### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)
- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats
### 3. Updated Configuration (`src/config.py`)
- Added `VECTOR_STORAGE_TYPE` environment variable
- PostgreSQL connection settings
- Memory optimization parameters
### 4. Factory Pattern (`src/vector_store/vector_db.py`)
- `create_vector_database()` function selects backend automatically
- Supports both ChromaDB and PostgreSQL based on configuration
### 5. Migration Script (`scripts/migrate_to_postgres.py`)
- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier
### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)
- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface
## Setup Instructions
### Step 1: Create Render PostgreSQL Database
1. Go to Render Dashboard
2. Create → PostgreSQL
3. Choose "Free" plan (1GB storage, 30 days)
4. Save the connection details
### Step 2: Enable pgvector Extension
You have several options to enable pgvector:
**Option A: Use the initialization script (Recommended)**
```bash
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"
# Run the initialization script
python scripts/init_pgvector.py
```
**Option B: Manual SQL**
Connect to your database and run:
```sql
CREATE EXTENSION IF NOT EXISTS vector;
```
**Option C: From Render Dashboard**
1. Go to your PostgreSQL service → Info tab
2. Use the "PSQL Command" to connect
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`
The initialization script (`scripts/init_pgvector.py`) will:
- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages
### Step 3: Update Environment Variables
Add to your Render environment variables:
```bash
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
```
### Step 4: Install Dependencies
```bash
pip install psycopg2-binary==2.9.7
```
### Step 5: Run Migration (Optional)
If you have existing ChromaDB data:
```bash
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
```
## Usage
### Switch to PostgreSQL
Set environment variable:
```bash
export VECTOR_STORAGE_TYPE=postgres
```
### Use in Code (No Changes Required!)
```python
from src.vector_store.vector_db import create_vector_database
# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
```
## Expected Memory Reduction
| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
| ---------------- | ----------------- | -------------------- | ------------- |
| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
| Application Code | 50-100MB | 50-100MB | 0MB |
| **Total** | **350-500MB** | **50-150MB** | **300-350MB** |
## Migration Optimizations
### Data Size Reduction
- **Text Summarization**: Documents truncated to 1000 characters
- **Metadata Cleaning**: Only essential fields kept
- **Dimension Reduction**: Can use smaller embedding models
- **Quality Filtering**: Skip very short or low-quality documents
### Memory Management
- **Batch Processing**: Process documents in small batches
- **Garbage Collection**: Aggressive cleanup between operations
- **Streaming**: Process data without loading everything into memory
## Testing
### Unit Tests
```bash
pytest tests/test_vector_store/test_postgres_vector.py -v
```
### Integration Tests (Requires Database)
```bash
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
```
### Migration Test
```bash
python scripts/migrate_to_postgres.py --test-only
```
## Deployment
### Local Development
Keep using ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```
### Production (Render)
Switch to PostgreSQL:
```bash
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
```
## Troubleshooting
### Common Issues
1. **"pgvector extension not found"**
- Run `CREATE EXTENSION vector;` in your database
2. **Connection errors**
- Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
- Check firewall/network connectivity
3. **Memory still high**
- Verify `VECTOR_STORAGE_TYPE=postgres`
- Check that old ChromaDB files aren't being loaded
### Monitoring
```python
from src.vector_db.postgres_vector_service import PostgresVectorService
service = PostgresVectorService()
health = service.health_check()
print(health) # Shows connection status, document count, etc.
```
## Rollback Plan
If issues occur, simply change back to ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```
The factory pattern ensures seamless switching between backends.
## Performance Comparison
| Operation | ChromaDB | PostgreSQL | Notes |
| ----------- | ---------- | ---------- | ---------------------- |
| Insert | Fast | Medium | Network overhead |
| Search | Very Fast | Fast | pgvector is optimized |
| Memory | High | Low | Vectors stored on disk |
| Persistence | File-based | Database | More reliable |
| Scaling | Limited | Excellent | Can upgrade storage |
## Next Steps
1. Test locally with PostgreSQL
2. Create Render PostgreSQL database
3. Run migration script
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
5. Monitor memory usage in production