# PostgreSQL Migration Guide

## Overview
This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

## What's Been Implemented

### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)
- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info

### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)
- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats

### 3. Updated Configuration (`src/config.py`)
- Added `VECTOR_STORAGE_TYPE` environment variable
- PostgreSQL connection settings
- Memory optimization parameters

### 4. Factory Pattern (`src/vector_store/vector_db.py`)
- `create_vector_database()` function selects backend automatically
- Supports both ChromaDB and PostgreSQL based on configuration

### 5. Migration Script (`scripts/migrate_to_postgres.py`)
- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier

### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)
- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface

## Setup Instructions

### Step 1: Create Render PostgreSQL Database
1. Go to Render Dashboard
2. Create → PostgreSQL
3. Choose "Free" plan (1GB storage, 30 days)
4. Save the connection details

### Step 2: Enable pgvector Extension
You have several options to enable pgvector:

**Option A: Use the initialization script (Recommended)**
```bash
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"

# Run the initialization script
python scripts/init_pgvector.py
```

**Option B: Manual SQL**
Connect to your database and run:
```sql
CREATE EXTENSION IF NOT EXISTS vector;
```

**Option C: From Render Dashboard**
1. Go to your PostgreSQL service → Info tab
2. Use the "PSQL Command" to connect
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`

The initialization script (`scripts/init_pgvector.py`) will:
- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages

### Step 3: Update Environment Variables
Add to your Render environment variables:
```bash
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
```

### Step 4: Install Dependencies
```bash
pip install psycopg2-binary==2.9.7
```

### Step 5: Run Migration (Optional)
If you have existing ChromaDB data:
```bash
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
```

## Usage

### Switch to PostgreSQL
Set environment variable:
```bash
export VECTOR_STORAGE_TYPE=postgres
```

### Use in Code (No Changes Required!)
```python
from src.vector_store.vector_db import create_vector_database

# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
```

## Expected Memory Reduction

| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
|-----------|------------------|-------------------|---------|
| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
| Application Code | 50-100MB | 50-100MB | 0MB |
| **Total** | **350-500MB** | **50-150MB** | **300-350MB** |

## Migration Optimizations

### Data Size Reduction
- **Text Summarization**: Documents truncated to 1000 characters
- **Metadata Cleaning**: Only essential fields kept
- **Dimension Reduction**: Can use smaller embedding models
- **Quality Filtering**: Skip very short or low-quality documents

### Memory Management
- **Batch Processing**: Process documents in small batches
- **Garbage Collection**: Aggressive cleanup between operations
- **Streaming**: Process data without loading everything into memory

## Testing

### Unit Tests
```bash
pytest tests/test_vector_store/test_postgres_vector.py -v
```

### Integration Tests (Requires Database)
```bash
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
```

### Migration Test
```bash
python scripts/migrate_to_postgres.py --test-only
```

## Deployment

### Local Development
Keep using ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```

### Production (Render)
Switch to PostgreSQL:
```bash
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
```

## Troubleshooting

### Common Issues
1. **"pgvector extension not found"**
   - Run `CREATE EXTENSION vector;` in your database

2. **Connection errors**
   - Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
   - Check firewall/network connectivity

3. **Memory still high**
   - Verify `VECTOR_STORAGE_TYPE=postgres`
   - Check that old ChromaDB files aren't being loaded

### Monitoring
```python
from src.vector_db.postgres_vector_service import PostgresVectorService

service = PostgresVectorService()
health = service.health_check()
print(health)  # Shows connection status, document count, etc.
```

## Rollback Plan
If issues occur, simply change back to ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```
The factory pattern ensures seamless switching between backends.

## Performance Comparison

| Operation | ChromaDB | PostgreSQL | Notes |
|-----------|----------|------------|-------|
| Insert | Fast | Medium | Network overhead |
| Search | Very Fast | Fast | pgvector is optimized |
| Memory | High | Low | Vectors stored on disk |
| Persistence | File-based | Database | More reliable |
| Scaling | Limited | Excellent | Can upgrade storage |

## Next Steps
1. Test locally with PostgreSQL
2. Create Render PostgreSQL database
3. Run migration script
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
5. Monitor memory usage in production