Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

File size: 6,652 Bytes

dca679b
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
 
 
45cf08e
 
 
 
 
 
dca679b
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
45cf08e
 
 
 
 
 
 
dca679b
 
45cf08e
dca679b

# PostgreSQL Migration Guide

## Overview

This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

## What's Been Implemented

### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)

- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info

### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)

- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats

### 3. Updated Configuration (`src/config.py`)

- Added `VECTOR_STORAGE_TYPE` environment variable
- PostgreSQL connection settings
- Memory optimization parameters

### 4. Factory Pattern (`src/vector_store/vector_db.py`)

- `create_vector_database()` function selects backend automatically
- Supports both ChromaDB and PostgreSQL based on configuration

### 5. Migration Script (`scripts/migrate_to_postgres.py`)

- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier

### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)

- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface

## Setup Instructions

### Step 1: Create Render PostgreSQL Database

1. Go to Render Dashboard
2. Create → PostgreSQL
3. Choose "Free" plan (1GB storage, 30 days)
4. Save the connection details

### Step 2: Enable pgvector Extension

You have several options to enable pgvector:

**Option A: Use the initialization script (Recommended)**

```bash
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"

# Run the initialization script
python scripts/init_pgvector.py
```

**Option B: Manual SQL**
Connect to your database and run:

```sql
CREATE EXTENSION IF NOT EXISTS vector;
```

**Option C: From Render Dashboard**

1. Go to your PostgreSQL service → Info tab
2. Use the "PSQL Command" to connect
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`

The initialization script (`scripts/init_pgvector.py`) will:

- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages

### Step 3: Update Environment Variables

Add to your Render environment variables:

```bash
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
```

### Step 4: Install Dependencies

```bash
pip install psycopg2-binary==2.9.7
```

### Step 5: Run Migration (Optional)

If you have existing ChromaDB data:

```bash
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
```

## Usage

### Switch to PostgreSQL

Set environment variable:

```bash
export VECTOR_STORAGE_TYPE=postgres
```

### Use in Code (No Changes Required!)

```python
from src.vector_store.vector_db import create_vector_database

# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
```

## Expected Memory Reduction

| Component        | Before (ChromaDB) | After (PostgreSQL)   | Savings       |
| ---------------- | ----------------- | -------------------- | ------------- |
| Vector Storage   | 200-300MB         | 0MB (disk)           | 200-300MB     |
| Embedding Model  | 100MB             | 50MB (smaller model) | 50MB          |
| Application Code | 50-100MB          | 50-100MB             | 0MB           |
| **Total**        | **350-500MB**     | **50-150MB**         | **300-350MB** |

## Migration Optimizations

### Data Size Reduction

- **Text Summarization**: Documents truncated to 1000 characters
- **Metadata Cleaning**: Only essential fields kept
- **Dimension Reduction**: Can use smaller embedding models
- **Quality Filtering**: Skip very short or low-quality documents

### Memory Management

- **Batch Processing**: Process documents in small batches
- **Garbage Collection**: Aggressive cleanup between operations
- **Streaming**: Process data without loading everything into memory

## Testing

### Unit Tests

```bash
pytest tests/test_vector_store/test_postgres_vector.py -v
```

### Integration Tests (Requires Database)

```bash
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
```

### Migration Test

```bash
python scripts/migrate_to_postgres.py --test-only
```

## Deployment

### Local Development

Keep using ChromaDB:

```bash
export VECTOR_STORAGE_TYPE=chroma
```

### Production (Render)

Switch to PostgreSQL:

```bash
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
```

## Troubleshooting

### Common Issues

1. **"pgvector extension not found"**

   - Run `CREATE EXTENSION vector;` in your database

2. **Connection errors**

   - Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
   - Check firewall/network connectivity

3. **Memory still high**
   - Verify `VECTOR_STORAGE_TYPE=postgres`
   - Check that old ChromaDB files aren't being loaded

### Monitoring

```python
from src.vector_db.postgres_vector_service import PostgresVectorService

service = PostgresVectorService()
health = service.health_check()
print(health)  # Shows connection status, document count, etc.
```

## Rollback Plan

If issues occur, simply change back to ChromaDB:

```bash
export VECTOR_STORAGE_TYPE=chroma
```

The factory pattern ensures seamless switching between backends.

## Performance Comparison

| Operation   | ChromaDB   | PostgreSQL | Notes                  |
| ----------- | ---------- | ---------- | ---------------------- |
| Insert      | Fast       | Medium     | Network overhead       |
| Search      | Very Fast  | Fast       | pgvector is optimized  |
| Memory      | High       | Low        | Vectors stored on disk |
| Persistence | File-based | Database   | More reliable          |
| Scaling     | Limited    | Excellent  | Can upgrade storage    |

## Next Steps

1. Test locally with PostgreSQL
2. Create Render PostgreSQL database
3. Run migration script
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
5. Monitor memory usage in production