Spaces:
Sleeping
Sleeping
File size: 6,652 Bytes
dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b 45cf08e dca679b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 |
# PostgreSQL Migration Guide
## Overview
This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.
## What's Been Implemented
### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)
- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info
### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)
- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats
### 3. Updated Configuration (`src/config.py`)
- Added `VECTOR_STORAGE_TYPE` environment variable
- PostgreSQL connection settings
- Memory optimization parameters
### 4. Factory Pattern (`src/vector_store/vector_db.py`)
- `create_vector_database()` function selects backend automatically
- Supports both ChromaDB and PostgreSQL based on configuration
### 5. Migration Script (`scripts/migrate_to_postgres.py`)
- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier
### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)
- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface
## Setup Instructions
### Step 1: Create Render PostgreSQL Database
1. Go to Render Dashboard
2. Create → PostgreSQL
3. Choose "Free" plan (1GB storage, 30 days)
4. Save the connection details
### Step 2: Enable pgvector Extension
You have several options to enable pgvector:
**Option A: Use the initialization script (Recommended)**
```bash
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"
# Run the initialization script
python scripts/init_pgvector.py
```
**Option B: Manual SQL**
Connect to your database and run:
```sql
CREATE EXTENSION IF NOT EXISTS vector;
```
**Option C: From Render Dashboard**
1. Go to your PostgreSQL service → Info tab
2. Use the "PSQL Command" to connect
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`
The initialization script (`scripts/init_pgvector.py`) will:
- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages
### Step 3: Update Environment Variables
Add to your Render environment variables:
```bash
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
```
### Step 4: Install Dependencies
```bash
pip install psycopg2-binary==2.9.7
```
### Step 5: Run Migration (Optional)
If you have existing ChromaDB data:
```bash
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
```
## Usage
### Switch to PostgreSQL
Set environment variable:
```bash
export VECTOR_STORAGE_TYPE=postgres
```
### Use in Code (No Changes Required!)
```python
from src.vector_store.vector_db import create_vector_database
# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
```
## Expected Memory Reduction
| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
| ---------------- | ----------------- | -------------------- | ------------- |
| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
| Application Code | 50-100MB | 50-100MB | 0MB |
| **Total** | **350-500MB** | **50-150MB** | **300-350MB** |
## Migration Optimizations
### Data Size Reduction
- **Text Summarization**: Documents truncated to 1000 characters
- **Metadata Cleaning**: Only essential fields kept
- **Dimension Reduction**: Can use smaller embedding models
- **Quality Filtering**: Skip very short or low-quality documents
### Memory Management
- **Batch Processing**: Process documents in small batches
- **Garbage Collection**: Aggressive cleanup between operations
- **Streaming**: Process data without loading everything into memory
## Testing
### Unit Tests
```bash
pytest tests/test_vector_store/test_postgres_vector.py -v
```
### Integration Tests (Requires Database)
```bash
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
```
### Migration Test
```bash
python scripts/migrate_to_postgres.py --test-only
```
## Deployment
### Local Development
Keep using ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```
### Production (Render)
Switch to PostgreSQL:
```bash
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
```
## Troubleshooting
### Common Issues
1. **"pgvector extension not found"**
- Run `CREATE EXTENSION vector;` in your database
2. **Connection errors**
- Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
- Check firewall/network connectivity
3. **Memory still high**
- Verify `VECTOR_STORAGE_TYPE=postgres`
- Check that old ChromaDB files aren't being loaded
### Monitoring
```python
from src.vector_db.postgres_vector_service import PostgresVectorService
service = PostgresVectorService()
health = service.health_check()
print(health) # Shows connection status, document count, etc.
```
## Rollback Plan
If issues occur, simply change back to ChromaDB:
```bash
export VECTOR_STORAGE_TYPE=chroma
```
The factory pattern ensures seamless switching between backends.
## Performance Comparison
| Operation | ChromaDB | PostgreSQL | Notes |
| ----------- | ---------- | ---------- | ---------------------- |
| Insert | Fast | Medium | Network overhead |
| Search | Very Fast | Fast | pgvector is optimized |
| Memory | High | Low | Vectors stored on disk |
| Persistence | File-based | Database | More reliable |
| Scaling | Limited | Excellent | Can upgrade storage |
## Next Steps
1. Test locally with PostgreSQL
2. Create Render PostgreSQL database
3. Run migration script
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
5. Monitor memory usage in production
|