Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

msse-ai-engineering / POSTGRES_MIGRATION.md

sethmcknight

refactor: enhance gunicorn startup script with health checks and config handling

45cf08e about 2 months ago

preview code

raw

history blame contribute delete

6.65 kB

	# PostgreSQL Migration Guide

	## Overview

	This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

	## What's Been Implemented

	### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)

	- Full PostgreSQL integration with pgvector extension
	- Automatic table creation and indexing
	- Similarity search using cosine distance
	- Document CRUD operations
	- Health monitoring and collection info

	### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)

	- Compatibility layer for existing ChromaDB interface
	- Ensures seamless migration without code changes
	- Converts between PostgreSQL and ChromaDB result formats

	### 3. Updated Configuration (`src/config.py`)

	- Added `VECTOR_STORAGE_TYPE` environment variable
	- PostgreSQL connection settings
	- Memory optimization parameters

	### 4. Factory Pattern (`src/vector_store/vector_db.py`)

	- `create_vector_database()` function selects backend automatically
	- Supports both ChromaDB and PostgreSQL based on configuration

	### 5. Migration Script (`scripts/migrate_to_postgres.py`)

	- Data optimization (text summarization, metadata cleaning)
	- Batch processing with memory management
	- Handles 4GB → 1GB data reduction for free tier

	### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)

	- Unit tests with mocked dependencies
	- Integration tests for real database
	- Compatibility tests for ChromaDB interface

	## Setup Instructions

	### Step 1: Create Render PostgreSQL Database

	1. Go to Render Dashboard
	2. Create → PostgreSQL
	3. Choose "Free" plan (1GB storage, 30 days)
	4. Save the connection details

	### Step 2: Enable pgvector Extension

	You have several options to enable pgvector:

	Option A: Use the initialization script (Recommended)

	```bash
	# Set your database URL
	export DATABASE_URL="postgresql://user:password@host:port/database"

	# Run the initialization script
	python scripts/init_pgvector.py
	```

	Option B: Manual SQL
	Connect to your database and run:

	```sql
	CREATE EXTENSION IF NOT EXISTS vector;
	```

	Option C: From Render Dashboard

	1. Go to your PostgreSQL service → Info tab
	2. Use the "PSQL Command" to connect
	3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`

	The initialization script (`scripts/init_pgvector.py`) will:

	- Test database connection
	- Check PostgreSQL version compatibility (13+)
	- Install pgvector extension safely
	- Verify vector operations work correctly
	- Provide detailed logging and error messages

	### Step 3: Update Environment Variables

	Add to your Render environment variables:

	```bash
	DATABASE_URL=postgresql://username:password@host:port/database
	VECTOR_STORAGE_TYPE=postgres
	MEMORY_LIMIT_MB=400
	```

	### Step 4: Install Dependencies

	```bash
	pip install psycopg2-binary==2.9.7
	```

	### Step 5: Run Migration (Optional)

	If you have existing ChromaDB data:

	```bash
	python scripts/migrate_to_postgres.py --database-url="your-connection-string"
	```

	## Usage

	### Switch to PostgreSQL

	Set environment variable:

	```bash
	export VECTOR_STORAGE_TYPE=postgres
	```

	### Use in Code (No Changes Required!)

	```python
	from src.vector_store.vector_db import create_vector_database

	# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
	vector_db = create_vector_database()
	vector_db.add_embeddings(embeddings, ids, documents, metadatas)
	results = vector_db.search(query_embedding, top_k=5)
	```

	## Expected Memory Reduction

	\| Component \| Before (ChromaDB) \| After (PostgreSQL) \| Savings \|
	\| ---------------- \| ----------------- \| -------------------- \| ------------- \|
	\| Vector Storage \| 200-300MB \| 0MB (disk) \| 200-300MB \|
	\| Embedding Model \| 100MB \| 50MB (smaller model) \| 50MB \|
	\| Application Code \| 50-100MB \| 50-100MB \| 0MB \|
	\| Total \| 350-500MB \| 50-150MB \| 300-350MB \|

	## Migration Optimizations

	### Data Size Reduction

	- Text Summarization: Documents truncated to 1000 characters
	- Metadata Cleaning: Only essential fields kept
	- Dimension Reduction: Can use smaller embedding models
	- Quality Filtering: Skip very short or low-quality documents

	### Memory Management

	- Batch Processing: Process documents in small batches
	- Garbage Collection: Aggressive cleanup between operations
	- Streaming: Process data without loading everything into memory

	## Testing

	### Unit Tests

	```bash
	pytest tests/test_vector_store/test_postgres_vector.py -v
	```

	### Integration Tests (Requires Database)

	```bash
	export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
	pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
	```

	### Migration Test

	```bash
	python scripts/migrate_to_postgres.py --test-only
	```

	## Deployment

	### Local Development

	Keep using ChromaDB:

	```bash
	export VECTOR_STORAGE_TYPE=chroma
	```

	### Production (Render)

	Switch to PostgreSQL:

	```bash
	export VECTOR_STORAGE_TYPE=postgres
	export DATABASE_URL="your-render-postgres-url"
	```

	## Troubleshooting

	### Common Issues

	1. "pgvector extension not found"

	- Run `CREATE EXTENSION vector;` in your database

	2. Connection errors

	- Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
	- Check firewall/network connectivity

	3. Memory still high
	- Verify `VECTOR_STORAGE_TYPE=postgres`
	- Check that old ChromaDB files aren't being loaded

	### Monitoring

	```python
	from src.vector_db.postgres_vector_service import PostgresVectorService

	service = PostgresVectorService()
	health = service.health_check()
	print(health) # Shows connection status, document count, etc.
	```

	## Rollback Plan

	If issues occur, simply change back to ChromaDB:

	```bash
	export VECTOR_STORAGE_TYPE=chroma
	```

	The factory pattern ensures seamless switching between backends.

	## Performance Comparison

	\| Operation \| ChromaDB \| PostgreSQL \| Notes \|
	\| ----------- \| ---------- \| ---------- \| ---------------------- \|
	\| Insert \| Fast \| Medium \| Network overhead \|
	\| Search \| Very Fast \| Fast \| pgvector is optimized \|
	\| Memory \| High \| Low \| Vectors stored on disk \|
	\| Persistence \| File-based \| Database \| More reliable \|
	\| Scaling \| Limited \| Excellent \| Can upgrade storage \|

	## Next Steps

	1. Test locally with PostgreSQL
	2. Create Render PostgreSQL database
	3. Run migration script
	4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
	5. Monitor memory usage in production