# PostgreSQL Migration Guide ## Overview This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM. ## What's Been Implemented ### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`) - Full PostgreSQL integration with pgvector extension - Automatic table creation and indexing - Similarity search using cosine distance - Document CRUD operations - Health monitoring and collection info ### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`) - Compatibility layer for existing ChromaDB interface - Ensures seamless migration without code changes - Converts between PostgreSQL and ChromaDB result formats ### 3. Updated Configuration (`src/config.py`) - Added `VECTOR_STORAGE_TYPE` environment variable - PostgreSQL connection settings - Memory optimization parameters ### 4. Factory Pattern (`src/vector_store/vector_db.py`) - `create_vector_database()` function selects backend automatically - Supports both ChromaDB and PostgreSQL based on configuration ### 5. Migration Script (`scripts/migrate_to_postgres.py`) - Data optimization (text summarization, metadata cleaning) - Batch processing with memory management - Handles 4GB → 1GB data reduction for free tier ### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`) - Unit tests with mocked dependencies - Integration tests for real database - Compatibility tests for ChromaDB interface ## Setup Instructions ### Step 1: Create Render PostgreSQL Database 1. Go to Render Dashboard 2. Create → PostgreSQL 3. Choose "Free" plan (1GB storage, 30 days) 4. Save the connection details ### Step 2: Enable pgvector Extension You have several options to enable pgvector: **Option A: Use the initialization script (Recommended)** ```bash # Set your database URL export DATABASE_URL="postgresql://user:password@host:port/database" # Run the initialization script python scripts/init_pgvector.py ``` **Option B: Manual SQL** Connect to your database and run: ```sql CREATE EXTENSION IF NOT EXISTS vector; ``` **Option C: From Render Dashboard** 1. Go to your PostgreSQL service → Info tab 2. Use the "PSQL Command" to connect 3. Run: `CREATE EXTENSION IF NOT EXISTS vector;` The initialization script (`scripts/init_pgvector.py`) will: - Test database connection - Check PostgreSQL version compatibility (13+) - Install pgvector extension safely - Verify vector operations work correctly - Provide detailed logging and error messages ### Step 3: Update Environment Variables Add to your Render environment variables: ```bash DATABASE_URL=postgresql://username:password@host:port/database VECTOR_STORAGE_TYPE=postgres MEMORY_LIMIT_MB=400 ``` ### Step 4: Install Dependencies ```bash pip install psycopg2-binary==2.9.7 ``` ### Step 5: Run Migration (Optional) If you have existing ChromaDB data: ```bash python scripts/migrate_to_postgres.py --database-url="your-connection-string" ``` ## Usage ### Switch to PostgreSQL Set environment variable: ```bash export VECTOR_STORAGE_TYPE=postgres ``` ### Use in Code (No Changes Required!) ```python from src.vector_store.vector_db import create_vector_database # Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres vector_db = create_vector_database() vector_db.add_embeddings(embeddings, ids, documents, metadatas) results = vector_db.search(query_embedding, top_k=5) ``` ## Expected Memory Reduction | Component | Before (ChromaDB) | After (PostgreSQL) | Savings | |-----------|------------------|-------------------|---------| | Vector Storage | 200-300MB | 0MB (disk) | 200-300MB | | Embedding Model | 100MB | 50MB (smaller model) | 50MB | | Application Code | 50-100MB | 50-100MB | 0MB | | **Total** | **350-500MB** | **50-150MB** | **300-350MB** | ## Migration Optimizations ### Data Size Reduction - **Text Summarization**: Documents truncated to 1000 characters - **Metadata Cleaning**: Only essential fields kept - **Dimension Reduction**: Can use smaller embedding models - **Quality Filtering**: Skip very short or low-quality documents ### Memory Management - **Batch Processing**: Process documents in small batches - **Garbage Collection**: Aggressive cleanup between operations - **Streaming**: Process data without loading everything into memory ## Testing ### Unit Tests ```bash pytest tests/test_vector_store/test_postgres_vector.py -v ``` ### Integration Tests (Requires Database) ```bash export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db" pytest tests/test_vector_store/test_postgres_vector.py -m integration -v ``` ### Migration Test ```bash python scripts/migrate_to_postgres.py --test-only ``` ## Deployment ### Local Development Keep using ChromaDB: ```bash export VECTOR_STORAGE_TYPE=chroma ``` ### Production (Render) Switch to PostgreSQL: ```bash export VECTOR_STORAGE_TYPE=postgres export DATABASE_URL="your-render-postgres-url" ``` ## Troubleshooting ### Common Issues 1. **"pgvector extension not found"** - Run `CREATE EXTENSION vector;` in your database 2. **Connection errors** - Verify DATABASE_URL format: `postgresql://user:pass@host:port/db` - Check firewall/network connectivity 3. **Memory still high** - Verify `VECTOR_STORAGE_TYPE=postgres` - Check that old ChromaDB files aren't being loaded ### Monitoring ```python from src.vector_db.postgres_vector_service import PostgresVectorService service = PostgresVectorService() health = service.health_check() print(health) # Shows connection status, document count, etc. ``` ## Rollback Plan If issues occur, simply change back to ChromaDB: ```bash export VECTOR_STORAGE_TYPE=chroma ``` The factory pattern ensures seamless switching between backends. ## Performance Comparison | Operation | ChromaDB | PostgreSQL | Notes | |-----------|----------|------------|-------| | Insert | Fast | Medium | Network overhead | | Search | Very Fast | Fast | pgvector is optimized | | Memory | High | Low | Vectors stored on disk | | Persistence | File-based | Database | More reliable | | Scaling | Limited | Excellent | Can upgrade storage | ## Next Steps 1. Test locally with PostgreSQL 2. Create Render PostgreSQL database 3. Run migration script 4. Deploy with `VECTOR_STORAGE_TYPE=postgres` 5. Monitor memory usage in production