Spaces:
Sleeping
Sleeping
Seth McKnight
commited on
Commit
·
dca679b
1
Parent(s):
48155ff
Postgres vector migration (#83)
Browse files* feat: Implement PostgreSQL with pgvector as ChromaDB alternative
- Add PostgresVectorService with full pgvector integration
- Create PostgresVectorAdapter for ChromaDB compatibility
- Update config to support vector storage type selection
- Add factory pattern for seamless backend switching
- Include migration script with data optimization
- Add comprehensive tests for PostgreSQL implementation
- Update dependencies and environment configuration
- Expected memory reduction: 300-350MB (from 400MB+ to 50-150MB)
This enables deployment on Render's 512MB free tier by using persistent
PostgreSQL storage instead of in-memory ChromaDB.
* Add pgvector init script, update migration docs, and test adjustments
- POSTGRES_MIGRATION.md +214 -0
- dev-requirements.txt +1 -0
- requirements.txt +1 -0
- scripts/init_pgvector.py +208 -0
- scripts/migrate_to_postgres.py +434 -0
- src/config.py +25 -6
- src/vector_db/postgres_adapter.py +126 -0
- src/vector_db/postgres_vector_service.py +473 -0
- src/vector_store/vector_db.py +31 -1
- tests/test_embedding/test_embedding_service.py +5 -4
- tests/test_vector_store/test_postgres_vector.py +366 -0
- tests/test_vector_store/test_postgres_vector_simple.py +72 -0
POSTGRES_MIGRATION.md
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# PostgreSQL Migration Guide
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.
|
| 5 |
+
|
| 6 |
+
## What's Been Implemented
|
| 7 |
+
|
| 8 |
+
### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)
|
| 9 |
+
- Full PostgreSQL integration with pgvector extension
|
| 10 |
+
- Automatic table creation and indexing
|
| 11 |
+
- Similarity search using cosine distance
|
| 12 |
+
- Document CRUD operations
|
| 13 |
+
- Health monitoring and collection info
|
| 14 |
+
|
| 15 |
+
### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)
|
| 16 |
+
- Compatibility layer for existing ChromaDB interface
|
| 17 |
+
- Ensures seamless migration without code changes
|
| 18 |
+
- Converts between PostgreSQL and ChromaDB result formats
|
| 19 |
+
|
| 20 |
+
### 3. Updated Configuration (`src/config.py`)
|
| 21 |
+
- Added `VECTOR_STORAGE_TYPE` environment variable
|
| 22 |
+
- PostgreSQL connection settings
|
| 23 |
+
- Memory optimization parameters
|
| 24 |
+
|
| 25 |
+
### 4. Factory Pattern (`src/vector_store/vector_db.py`)
|
| 26 |
+
- `create_vector_database()` function selects backend automatically
|
| 27 |
+
- Supports both ChromaDB and PostgreSQL based on configuration
|
| 28 |
+
|
| 29 |
+
### 5. Migration Script (`scripts/migrate_to_postgres.py`)
|
| 30 |
+
- Data optimization (text summarization, metadata cleaning)
|
| 31 |
+
- Batch processing with memory management
|
| 32 |
+
- Handles 4GB → 1GB data reduction for free tier
|
| 33 |
+
|
| 34 |
+
### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)
|
| 35 |
+
- Unit tests with mocked dependencies
|
| 36 |
+
- Integration tests for real database
|
| 37 |
+
- Compatibility tests for ChromaDB interface
|
| 38 |
+
|
| 39 |
+
## Setup Instructions
|
| 40 |
+
|
| 41 |
+
### Step 1: Create Render PostgreSQL Database
|
| 42 |
+
1. Go to Render Dashboard
|
| 43 |
+
2. Create → PostgreSQL
|
| 44 |
+
3. Choose "Free" plan (1GB storage, 30 days)
|
| 45 |
+
4. Save the connection details
|
| 46 |
+
|
| 47 |
+
### Step 2: Enable pgvector Extension
|
| 48 |
+
You have several options to enable pgvector:
|
| 49 |
+
|
| 50 |
+
**Option A: Use the initialization script (Recommended)**
|
| 51 |
+
```bash
|
| 52 |
+
# Set your database URL
|
| 53 |
+
export DATABASE_URL="postgresql://user:password@host:port/database"
|
| 54 |
+
|
| 55 |
+
# Run the initialization script
|
| 56 |
+
python scripts/init_pgvector.py
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
**Option B: Manual SQL**
|
| 60 |
+
Connect to your database and run:
|
| 61 |
+
```sql
|
| 62 |
+
CREATE EXTENSION IF NOT EXISTS vector;
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
**Option C: From Render Dashboard**
|
| 66 |
+
1. Go to your PostgreSQL service → Info tab
|
| 67 |
+
2. Use the "PSQL Command" to connect
|
| 68 |
+
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`
|
| 69 |
+
|
| 70 |
+
The initialization script (`scripts/init_pgvector.py`) will:
|
| 71 |
+
- Test database connection
|
| 72 |
+
- Check PostgreSQL version compatibility (13+)
|
| 73 |
+
- Install pgvector extension safely
|
| 74 |
+
- Verify vector operations work correctly
|
| 75 |
+
- Provide detailed logging and error messages
|
| 76 |
+
|
| 77 |
+
### Step 3: Update Environment Variables
|
| 78 |
+
Add to your Render environment variables:
|
| 79 |
+
```bash
|
| 80 |
+
DATABASE_URL=postgresql://username:password@host:port/database
|
| 81 |
+
VECTOR_STORAGE_TYPE=postgres
|
| 82 |
+
MEMORY_LIMIT_MB=400
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Step 4: Install Dependencies
|
| 86 |
+
```bash
|
| 87 |
+
pip install psycopg2-binary==2.9.7
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Step 5: Run Migration (Optional)
|
| 91 |
+
If you have existing ChromaDB data:
|
| 92 |
+
```bash
|
| 93 |
+
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
## Usage
|
| 97 |
+
|
| 98 |
+
### Switch to PostgreSQL
|
| 99 |
+
Set environment variable:
|
| 100 |
+
```bash
|
| 101 |
+
export VECTOR_STORAGE_TYPE=postgres
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
### Use in Code (No Changes Required!)
|
| 105 |
+
```python
|
| 106 |
+
from src.vector_store.vector_db import create_vector_database
|
| 107 |
+
|
| 108 |
+
# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
|
| 109 |
+
vector_db = create_vector_database()
|
| 110 |
+
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
|
| 111 |
+
results = vector_db.search(query_embedding, top_k=5)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Expected Memory Reduction
|
| 115 |
+
|
| 116 |
+
| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
|
| 117 |
+
|-----------|------------------|-------------------|---------|
|
| 118 |
+
| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
|
| 119 |
+
| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
|
| 120 |
+
| Application Code | 50-100MB | 50-100MB | 0MB |
|
| 121 |
+
| **Total** | **350-500MB** | **50-150MB** | **300-350MB** |
|
| 122 |
+
|
| 123 |
+
## Migration Optimizations
|
| 124 |
+
|
| 125 |
+
### Data Size Reduction
|
| 126 |
+
- **Text Summarization**: Documents truncated to 1000 characters
|
| 127 |
+
- **Metadata Cleaning**: Only essential fields kept
|
| 128 |
+
- **Dimension Reduction**: Can use smaller embedding models
|
| 129 |
+
- **Quality Filtering**: Skip very short or low-quality documents
|
| 130 |
+
|
| 131 |
+
### Memory Management
|
| 132 |
+
- **Batch Processing**: Process documents in small batches
|
| 133 |
+
- **Garbage Collection**: Aggressive cleanup between operations
|
| 134 |
+
- **Streaming**: Process data without loading everything into memory
|
| 135 |
+
|
| 136 |
+
## Testing
|
| 137 |
+
|
| 138 |
+
### Unit Tests
|
| 139 |
+
```bash
|
| 140 |
+
pytest tests/test_vector_store/test_postgres_vector.py -v
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
### Integration Tests (Requires Database)
|
| 144 |
+
```bash
|
| 145 |
+
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
|
| 146 |
+
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Migration Test
|
| 150 |
+
```bash
|
| 151 |
+
python scripts/migrate_to_postgres.py --test-only
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
## Deployment
|
| 155 |
+
|
| 156 |
+
### Local Development
|
| 157 |
+
Keep using ChromaDB:
|
| 158 |
+
```bash
|
| 159 |
+
export VECTOR_STORAGE_TYPE=chroma
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### Production (Render)
|
| 163 |
+
Switch to PostgreSQL:
|
| 164 |
+
```bash
|
| 165 |
+
export VECTOR_STORAGE_TYPE=postgres
|
| 166 |
+
export DATABASE_URL="your-render-postgres-url"
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
## Troubleshooting
|
| 170 |
+
|
| 171 |
+
### Common Issues
|
| 172 |
+
1. **"pgvector extension not found"**
|
| 173 |
+
- Run `CREATE EXTENSION vector;` in your database
|
| 174 |
+
|
| 175 |
+
2. **Connection errors**
|
| 176 |
+
- Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
|
| 177 |
+
- Check firewall/network connectivity
|
| 178 |
+
|
| 179 |
+
3. **Memory still high**
|
| 180 |
+
- Verify `VECTOR_STORAGE_TYPE=postgres`
|
| 181 |
+
- Check that old ChromaDB files aren't being loaded
|
| 182 |
+
|
| 183 |
+
### Monitoring
|
| 184 |
+
```python
|
| 185 |
+
from src.vector_db.postgres_vector_service import PostgresVectorService
|
| 186 |
+
|
| 187 |
+
service = PostgresVectorService()
|
| 188 |
+
health = service.health_check()
|
| 189 |
+
print(health) # Shows connection status, document count, etc.
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
## Rollback Plan
|
| 193 |
+
If issues occur, simply change back to ChromaDB:
|
| 194 |
+
```bash
|
| 195 |
+
export VECTOR_STORAGE_TYPE=chroma
|
| 196 |
+
```
|
| 197 |
+
The factory pattern ensures seamless switching between backends.
|
| 198 |
+
|
| 199 |
+
## Performance Comparison
|
| 200 |
+
|
| 201 |
+
| Operation | ChromaDB | PostgreSQL | Notes |
|
| 202 |
+
|-----------|----------|------------|-------|
|
| 203 |
+
| Insert | Fast | Medium | Network overhead |
|
| 204 |
+
| Search | Very Fast | Fast | pgvector is optimized |
|
| 205 |
+
| Memory | High | Low | Vectors stored on disk |
|
| 206 |
+
| Persistence | File-based | Database | More reliable |
|
| 207 |
+
| Scaling | Limited | Excellent | Can upgrade storage |
|
| 208 |
+
|
| 209 |
+
## Next Steps
|
| 210 |
+
1. Test locally with PostgreSQL
|
| 211 |
+
2. Create Render PostgreSQL database
|
| 212 |
+
3. Run migration script
|
| 213 |
+
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
|
| 214 |
+
5. Monitor memory usage in production
|
dev-requirements.txt
CHANGED
|
@@ -3,3 +3,4 @@ black>=25.0.0
|
|
| 3 |
isort==5.13.0
|
| 4 |
flake8==6.1.0
|
| 5 |
psutil
|
|
|
|
|
|
| 3 |
isort==5.13.0
|
| 4 |
flake8==6.1.0
|
| 5 |
psutil
|
| 6 |
+
psycopg2-binary==2.9.7
|
requirements.txt
CHANGED
|
@@ -5,6 +5,7 @@ gunicorn==22.0.0
|
|
| 5 |
# Vector database and embeddings
|
| 6 |
chromadb==0.4.24
|
| 7 |
sentence-transformers==2.7.0
|
|
|
|
| 8 |
|
| 9 |
# Core dependencies (pinned for reproducibility, Python 3.12 compatible)
|
| 10 |
numpy==1.26.4
|
|
|
|
| 5 |
# Vector database and embeddings
|
| 6 |
chromadb==0.4.24
|
| 7 |
sentence-transformers==2.7.0
|
| 8 |
+
psycopg2-binary==2.9.7
|
| 9 |
|
| 10 |
# Core dependencies (pinned for reproducibility, Python 3.12 compatible)
|
| 11 |
numpy==1.26.4
|
scripts/init_pgvector.py
ADDED
|
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Initialize pgvector extension in PostgreSQL database.
|
| 4 |
+
|
| 5 |
+
This script connects to the database specified by DATABASE_URL environment variable
|
| 6 |
+
and enables the pgvector extension if not already installed.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python scripts/init_pgvector.py
|
| 10 |
+
|
| 11 |
+
Environment Variables:
|
| 12 |
+
DATABASE_URL: PostgreSQL connection string (required)
|
| 13 |
+
|
| 14 |
+
Exit Codes:
|
| 15 |
+
0: Success - pgvector extension is installed and working
|
| 16 |
+
1: Error - connection failed, extension installation failed, or other error
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
import logging
|
| 20 |
+
import os
|
| 21 |
+
import sys
|
| 22 |
+
|
| 23 |
+
import psycopg2 # type: ignore
|
| 24 |
+
import psycopg2.extras # type: ignore
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def setup_logging() -> logging.Logger:
|
| 28 |
+
"""Setup logging configuration."""
|
| 29 |
+
logging.basicConfig(
|
| 30 |
+
level=logging.INFO,
|
| 31 |
+
format="%(asctime)s - %(levelname)s - %(message)s",
|
| 32 |
+
datefmt="%Y-%m-%d %H:%M:%S",
|
| 33 |
+
)
|
| 34 |
+
return logging.getLogger(__name__)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def get_database_url() -> str:
|
| 38 |
+
"""Get DATABASE_URL from environment."""
|
| 39 |
+
database_url = os.getenv("DATABASE_URL")
|
| 40 |
+
if not database_url:
|
| 41 |
+
raise ValueError("DATABASE_URL environment variable is required")
|
| 42 |
+
return database_url
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def test_connection(connection_string: str, logger: logging.Logger) -> bool:
|
| 46 |
+
"""Test database connection."""
|
| 47 |
+
try:
|
| 48 |
+
with psycopg2.connect(connection_string) as conn:
|
| 49 |
+
with conn.cursor() as cur:
|
| 50 |
+
cur.execute("SELECT 1;")
|
| 51 |
+
result = cur.fetchone()
|
| 52 |
+
if result and result[0] == 1:
|
| 53 |
+
logger.info("✅ Database connection successful")
|
| 54 |
+
return True
|
| 55 |
+
else:
|
| 56 |
+
logger.error("❌ Unexpected result from connection test")
|
| 57 |
+
return False
|
| 58 |
+
except Exception as e:
|
| 59 |
+
logger.error(f"❌ Database connection failed: {e}")
|
| 60 |
+
return False
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def check_postgresql_version(connection_string: str, logger: logging.Logger) -> bool:
|
| 64 |
+
"""Check if PostgreSQL version supports pgvector (13+)."""
|
| 65 |
+
try:
|
| 66 |
+
with psycopg2.connect(connection_string) as conn:
|
| 67 |
+
with conn.cursor() as cur:
|
| 68 |
+
cur.execute("SELECT version();")
|
| 69 |
+
result = cur.fetchone()
|
| 70 |
+
if not result:
|
| 71 |
+
logger.error("❌ Could not get PostgreSQL version")
|
| 72 |
+
return False
|
| 73 |
+
|
| 74 |
+
version_string = str(result[0])
|
| 75 |
+
|
| 76 |
+
# Extract major version number
|
| 77 |
+
# Format: "PostgreSQL 15.4 on x86_64-pc-linux-gnu..."
|
| 78 |
+
version_parts = version_string.split()
|
| 79 |
+
if len(version_parts) >= 2:
|
| 80 |
+
version_number = version_parts[1].split(".")[0]
|
| 81 |
+
major_version = int(version_number)
|
| 82 |
+
|
| 83 |
+
if major_version >= 13:
|
| 84 |
+
logger.info(
|
| 85 |
+
f"✅ PostgreSQL version {major_version} supports pgvector"
|
| 86 |
+
)
|
| 87 |
+
return True
|
| 88 |
+
else:
|
| 89 |
+
logger.error(
|
| 90 |
+
"❌ PostgreSQL version %s is too old (requires 13+)",
|
| 91 |
+
major_version,
|
| 92 |
+
)
|
| 93 |
+
return False
|
| 94 |
+
else:
|
| 95 |
+
logger.warning(
|
| 96 |
+
f"⚠️ Could not parse PostgreSQL version: {version_string}"
|
| 97 |
+
)
|
| 98 |
+
return True # Proceed anyway
|
| 99 |
+
|
| 100 |
+
except Exception as e:
|
| 101 |
+
logger.error(f"❌ Failed to check PostgreSQL version: {e}")
|
| 102 |
+
return False
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def install_pgvector_extension(connection_string: str, logger: logging.Logger) -> bool:
|
| 106 |
+
"""Install pgvector extension."""
|
| 107 |
+
try:
|
| 108 |
+
with psycopg2.connect(connection_string) as conn:
|
| 109 |
+
conn.autocommit = True # Required for CREATE EXTENSION
|
| 110 |
+
with conn.cursor() as cur:
|
| 111 |
+
logger.info("Installing pgvector extension...")
|
| 112 |
+
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
|
| 113 |
+
logger.info("✅ pgvector extension installed successfully")
|
| 114 |
+
return True
|
| 115 |
+
|
| 116 |
+
except psycopg2.errors.InsufficientPrivilege as e:
|
| 117 |
+
logger.error("❌ Insufficient privileges to install extension: %s", str(e))
|
| 118 |
+
logger.error(
|
| 119 |
+
"Make sure your database user has CREATE privilege or is a superuser"
|
| 120 |
+
)
|
| 121 |
+
return False
|
| 122 |
+
except Exception as e:
|
| 123 |
+
logger.error(f"❌ Failed to install pgvector extension: {e}")
|
| 124 |
+
return False
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def verify_pgvector_installation(
|
| 128 |
+
connection_string: str, logger: logging.Logger
|
| 129 |
+
) -> bool:
|
| 130 |
+
"""Verify pgvector extension is properly installed."""
|
| 131 |
+
try:
|
| 132 |
+
with psycopg2.connect(connection_string) as conn:
|
| 133 |
+
with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cur:
|
| 134 |
+
# Check extension is installed
|
| 135 |
+
cur.execute(
|
| 136 |
+
"SELECT extname, extversion FROM pg_extension "
|
| 137 |
+
"WHERE extname = 'vector';"
|
| 138 |
+
)
|
| 139 |
+
result = cur.fetchone()
|
| 140 |
+
|
| 141 |
+
if not result:
|
| 142 |
+
logger.error("❌ pgvector extension not found in pg_extension")
|
| 143 |
+
return False
|
| 144 |
+
|
| 145 |
+
logger.info(f"✅ pgvector extension version: {result['extversion']}")
|
| 146 |
+
|
| 147 |
+
# Test basic vector functionality
|
| 148 |
+
cur.execute("SELECT '[1,2,3]'::vector(3);")
|
| 149 |
+
vector_result = cur.fetchone()
|
| 150 |
+
if vector_result:
|
| 151 |
+
logger.info("✅ Vector type functioning correctly")
|
| 152 |
+
else:
|
| 153 |
+
logger.error("❌ Vector type test failed")
|
| 154 |
+
return False
|
| 155 |
+
|
| 156 |
+
# Test vector operations
|
| 157 |
+
cur.execute("SELECT '[1,2,3]'::vector(3) <-> '[1,2,4]'::vector(3);")
|
| 158 |
+
distance_result = cur.fetchone()
|
| 159 |
+
if distance_result and distance_result[0] == 1.0:
|
| 160 |
+
logger.info("✅ Vector distance operations working")
|
| 161 |
+
return True
|
| 162 |
+
else:
|
| 163 |
+
logger.error("❌ Vector distance operations failed")
|
| 164 |
+
return False
|
| 165 |
+
|
| 166 |
+
except Exception as e:
|
| 167 |
+
logger.error(f"❌ Failed to verify pgvector installation: {e}")
|
| 168 |
+
return False
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def main() -> int:
|
| 172 |
+
"""Main function."""
|
| 173 |
+
logger = setup_logging()
|
| 174 |
+
|
| 175 |
+
try:
|
| 176 |
+
logger.info("🚀 Starting pgvector initialization...")
|
| 177 |
+
|
| 178 |
+
# Get database connection string
|
| 179 |
+
database_url = get_database_url()
|
| 180 |
+
logger.info("📡 Got DATABASE_URL from environment")
|
| 181 |
+
|
| 182 |
+
# Test connection
|
| 183 |
+
if not test_connection(database_url, logger):
|
| 184 |
+
return 1
|
| 185 |
+
|
| 186 |
+
# Check PostgreSQL version
|
| 187 |
+
if not check_postgresql_version(database_url, logger):
|
| 188 |
+
return 1
|
| 189 |
+
|
| 190 |
+
# Install pgvector extension
|
| 191 |
+
if not install_pgvector_extension(database_url, logger):
|
| 192 |
+
return 1
|
| 193 |
+
|
| 194 |
+
# Verify installation
|
| 195 |
+
if not verify_pgvector_installation(database_url, logger):
|
| 196 |
+
return 1
|
| 197 |
+
|
| 198 |
+
logger.info("🎉 pgvector initialization completed successfully!")
|
| 199 |
+
logger.info(" Your PostgreSQL database is now ready for vector operations.")
|
| 200 |
+
return 0
|
| 201 |
+
|
| 202 |
+
except Exception as e:
|
| 203 |
+
logger.error(f"❌ Unexpected error: {e}")
|
| 204 |
+
return 1
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
if __name__ == "__main__":
|
| 208 |
+
sys.exit(main())
|
scripts/migrate_to_postgres.py
ADDED
|
@@ -0,0 +1,434 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Migration script to move data from ChromaDB to PostgreSQL with data optimization.
|
| 4 |
+
This script reduces data size to fit within Render's 1GB PostgreSQL free tier limit.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import gc
|
| 8 |
+
import logging
|
| 9 |
+
import os
|
| 10 |
+
import re
|
| 11 |
+
import sys
|
| 12 |
+
from typing import Any, Dict, List, Optional
|
| 13 |
+
|
| 14 |
+
# Add the src directory to the path
|
| 15 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
|
| 16 |
+
|
| 17 |
+
from src.config import (
|
| 18 |
+
COLLECTION_NAME,
|
| 19 |
+
MAX_DOCUMENT_LENGTH,
|
| 20 |
+
MAX_DOCUMENTS_IN_MEMORY,
|
| 21 |
+
VECTOR_DB_PERSIST_PATH,
|
| 22 |
+
)
|
| 23 |
+
from src.embedding.embedding_service import EmbeddingService
|
| 24 |
+
from src.vector_db.postgres_vector_service import PostgresVectorService
|
| 25 |
+
from src.vector_store.vector_db import VectorDatabase
|
| 26 |
+
|
| 27 |
+
# Configure logging
|
| 28 |
+
logging.basicConfig(
|
| 29 |
+
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 30 |
+
)
|
| 31 |
+
logger = logging.getLogger(__name__)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class DataOptimizer:
|
| 35 |
+
"""Optimizes document data to reduce storage requirements."""
|
| 36 |
+
|
| 37 |
+
@staticmethod
|
| 38 |
+
def summarize_text(text: str, max_length: int = MAX_DOCUMENT_LENGTH) -> str:
|
| 39 |
+
"""
|
| 40 |
+
Summarize text to reduce storage while preserving key information.
|
| 41 |
+
|
| 42 |
+
Args:
|
| 43 |
+
text: Original text
|
| 44 |
+
max_length: Maximum length for summarized text
|
| 45 |
+
|
| 46 |
+
Returns:
|
| 47 |
+
Summarized text
|
| 48 |
+
"""
|
| 49 |
+
if len(text) <= max_length:
|
| 50 |
+
return text.strip()
|
| 51 |
+
|
| 52 |
+
# Simple extractive summarization: keep first few sentences
|
| 53 |
+
sentences = re.split(r"[.!?]+", text)
|
| 54 |
+
summary = ""
|
| 55 |
+
|
| 56 |
+
for sentence in sentences:
|
| 57 |
+
sentence = sentence.strip()
|
| 58 |
+
if not sentence:
|
| 59 |
+
continue
|
| 60 |
+
|
| 61 |
+
# Check if adding this sentence would exceed limit
|
| 62 |
+
if len(summary + sentence + ".") > max_length:
|
| 63 |
+
break
|
| 64 |
+
|
| 65 |
+
summary += sentence + ". "
|
| 66 |
+
|
| 67 |
+
# If summary is too short, take first max_length characters
|
| 68 |
+
if len(summary) < max_length // 4:
|
| 69 |
+
summary = text[:max_length].strip()
|
| 70 |
+
|
| 71 |
+
return summary.strip()
|
| 72 |
+
|
| 73 |
+
@staticmethod
|
| 74 |
+
def clean_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
|
| 75 |
+
"""
|
| 76 |
+
Clean metadata to keep only essential fields.
|
| 77 |
+
|
| 78 |
+
Args:
|
| 79 |
+
metadata: Original metadata
|
| 80 |
+
|
| 81 |
+
Returns:
|
| 82 |
+
Cleaned metadata with only essential fields
|
| 83 |
+
"""
|
| 84 |
+
essential_fields = {
|
| 85 |
+
"source",
|
| 86 |
+
"title",
|
| 87 |
+
"page",
|
| 88 |
+
"chunk_id",
|
| 89 |
+
"document_type",
|
| 90 |
+
"created_at",
|
| 91 |
+
"file_path",
|
| 92 |
+
"section",
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
cleaned = {}
|
| 96 |
+
for key, value in metadata.items():
|
| 97 |
+
if key in essential_fields and value is not None:
|
| 98 |
+
# Convert to simple types and truncate long strings
|
| 99 |
+
if isinstance(value, str) and len(value) > 100:
|
| 100 |
+
cleaned[key] = value[:100]
|
| 101 |
+
elif isinstance(value, (str, int, float, bool)):
|
| 102 |
+
cleaned[key] = value
|
| 103 |
+
|
| 104 |
+
return cleaned
|
| 105 |
+
|
| 106 |
+
@staticmethod
|
| 107 |
+
def should_include_document(metadata: Dict[str, Any], content: str) -> bool:
|
| 108 |
+
"""
|
| 109 |
+
Decide whether to include a document based on quality metrics.
|
| 110 |
+
|
| 111 |
+
Args:
|
| 112 |
+
metadata: Document metadata
|
| 113 |
+
content: Document content
|
| 114 |
+
|
| 115 |
+
Returns:
|
| 116 |
+
True if document should be included
|
| 117 |
+
"""
|
| 118 |
+
# Skip very short documents (likely not useful)
|
| 119 |
+
if len(content.strip()) < 50:
|
| 120 |
+
return False
|
| 121 |
+
|
| 122 |
+
# Skip documents with no meaningful content
|
| 123 |
+
if not re.search(r"[a-zA-Z]{3,}", content):
|
| 124 |
+
return False
|
| 125 |
+
|
| 126 |
+
# Prioritize certain document types if available
|
| 127 |
+
doc_type = metadata.get("document_type", "").lower()
|
| 128 |
+
if doc_type in ["policy", "procedure", "guideline"]:
|
| 129 |
+
return True
|
| 130 |
+
|
| 131 |
+
return True
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
class ChromaToPostgresMigrator:
|
| 135 |
+
"""Migrates data from ChromaDB to PostgreSQL with optimization."""
|
| 136 |
+
|
| 137 |
+
def __init__(self, database_url: Optional[str] = None):
|
| 138 |
+
"""
|
| 139 |
+
Initialize the migrator.
|
| 140 |
+
|
| 141 |
+
Args:
|
| 142 |
+
database_url: PostgreSQL connection string
|
| 143 |
+
"""
|
| 144 |
+
self.database_url = database_url or os.getenv("DATABASE_URL")
|
| 145 |
+
if not self.database_url:
|
| 146 |
+
raise ValueError("DATABASE_URL environment variable is required")
|
| 147 |
+
|
| 148 |
+
self.optimizer = DataOptimizer()
|
| 149 |
+
self.embedding_service = None
|
| 150 |
+
self.total_migrated = 0
|
| 151 |
+
self.total_skipped = 0
|
| 152 |
+
|
| 153 |
+
def initialize_services(self):
|
| 154 |
+
"""Initialize embedding service and database connections."""
|
| 155 |
+
logger.info("Initializing services...")
|
| 156 |
+
|
| 157 |
+
# Initialize embedding service
|
| 158 |
+
self.embedding_service = EmbeddingService()
|
| 159 |
+
|
| 160 |
+
# Initialize ChromaDB (source)
|
| 161 |
+
self.chroma_db = VectorDatabase(
|
| 162 |
+
persist_path=VECTOR_DB_PERSIST_PATH, collection_name=COLLECTION_NAME
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
# Initialize PostgreSQL (destination)
|
| 166 |
+
self.postgres_service = PostgresVectorService(
|
| 167 |
+
connection_string=self.database_url, table_name=COLLECTION_NAME
|
| 168 |
+
)
|
| 169 |
+
|
| 170 |
+
logger.info("Services initialized successfully")
|
| 171 |
+
|
| 172 |
+
def get_chroma_documents(
|
| 173 |
+
self, batch_size: int = MAX_DOCUMENTS_IN_MEMORY
|
| 174 |
+
) -> List[Dict[str, Any]]:
|
| 175 |
+
"""
|
| 176 |
+
Retrieve all documents from ChromaDB in batches.
|
| 177 |
+
|
| 178 |
+
Args:
|
| 179 |
+
batch_size: Number of documents to retrieve per batch
|
| 180 |
+
|
| 181 |
+
Yields:
|
| 182 |
+
Batches of documents
|
| 183 |
+
"""
|
| 184 |
+
try:
|
| 185 |
+
total_count = self.chroma_db.get_count()
|
| 186 |
+
logger.info(f"Found {total_count} documents in ChromaDB")
|
| 187 |
+
|
| 188 |
+
if total_count == 0:
|
| 189 |
+
return
|
| 190 |
+
|
| 191 |
+
# Get all documents (ChromaDB doesn't have native pagination)
|
| 192 |
+
collection = self.chroma_db.get_collection()
|
| 193 |
+
all_data = collection.get(include=["documents", "metadatas", "embeddings"])
|
| 194 |
+
|
| 195 |
+
if not all_data or not all_data.get("documents"):
|
| 196 |
+
logger.warning("No documents found in ChromaDB collection")
|
| 197 |
+
return
|
| 198 |
+
|
| 199 |
+
# Process in batches
|
| 200 |
+
documents = all_data["documents"]
|
| 201 |
+
metadatas = all_data.get("metadatas", [{}] * len(documents))
|
| 202 |
+
embeddings = all_data.get("embeddings", [])
|
| 203 |
+
ids = all_data.get("ids", [])
|
| 204 |
+
|
| 205 |
+
for i in range(0, len(documents), batch_size):
|
| 206 |
+
batch_end = min(i + batch_size, len(documents))
|
| 207 |
+
|
| 208 |
+
batch_docs = documents[i:batch_end]
|
| 209 |
+
batch_metadata = (
|
| 210 |
+
metadatas[i:batch_end] if metadatas else [{}] * len(batch_docs)
|
| 211 |
+
)
|
| 212 |
+
batch_embeddings = embeddings[i:batch_end] if embeddings else []
|
| 213 |
+
batch_ids = ids[i:batch_end] if ids else []
|
| 214 |
+
|
| 215 |
+
yield {
|
| 216 |
+
"documents": batch_docs,
|
| 217 |
+
"metadatas": batch_metadata,
|
| 218 |
+
"embeddings": batch_embeddings,
|
| 219 |
+
"ids": batch_ids,
|
| 220 |
+
}
|
| 221 |
+
|
| 222 |
+
except Exception as e:
|
| 223 |
+
logger.error(f"Error retrieving ChromaDB documents: {e}")
|
| 224 |
+
raise
|
| 225 |
+
|
| 226 |
+
def process_batch(self, batch: Dict[str, Any]) -> Dict[str, int]:
|
| 227 |
+
"""
|
| 228 |
+
Process a batch of documents with optimization.
|
| 229 |
+
|
| 230 |
+
Args:
|
| 231 |
+
batch: Batch of documents from ChromaDB
|
| 232 |
+
|
| 233 |
+
Returns:
|
| 234 |
+
Dictionary with processing statistics
|
| 235 |
+
"""
|
| 236 |
+
documents = batch["documents"]
|
| 237 |
+
metadatas = batch["metadatas"]
|
| 238 |
+
embeddings = batch["embeddings"]
|
| 239 |
+
|
| 240 |
+
processed_docs = []
|
| 241 |
+
processed_metadata = []
|
| 242 |
+
processed_embeddings = []
|
| 243 |
+
|
| 244 |
+
stats = {"processed": 0, "skipped": 0, "reembedded": 0}
|
| 245 |
+
|
| 246 |
+
for i, (doc, metadata) in enumerate(zip(documents, metadatas)):
|
| 247 |
+
# Clean and optimize document
|
| 248 |
+
cleaned_metadata = self.optimizer.clean_metadata(metadata or {})
|
| 249 |
+
|
| 250 |
+
# Check if we should include this document
|
| 251 |
+
if not self.optimizer.should_include_document(cleaned_metadata, doc):
|
| 252 |
+
stats["skipped"] += 1
|
| 253 |
+
continue
|
| 254 |
+
|
| 255 |
+
# Summarize document content
|
| 256 |
+
summarized_doc = self.optimizer.summarize_text(doc)
|
| 257 |
+
|
| 258 |
+
# Use existing embedding if available and document wasn't changed much
|
| 259 |
+
if embeddings and i < len(embeddings) and len(doc) == len(summarized_doc):
|
| 260 |
+
# Document unchanged, use existing embedding
|
| 261 |
+
embedding = embeddings[i]
|
| 262 |
+
else:
|
| 263 |
+
# Document changed, need new embedding
|
| 264 |
+
try:
|
| 265 |
+
embedding = self.embedding_service.generate_embeddings(
|
| 266 |
+
[summarized_doc]
|
| 267 |
+
)[0]
|
| 268 |
+
stats["reembedded"] += 1
|
| 269 |
+
except Exception as e:
|
| 270 |
+
logger.warning(
|
| 271 |
+
f"Failed to generate embedding for document {i}: {e}"
|
| 272 |
+
)
|
| 273 |
+
stats["skipped"] += 1
|
| 274 |
+
continue
|
| 275 |
+
|
| 276 |
+
processed_docs.append(summarized_doc)
|
| 277 |
+
processed_metadata.append(cleaned_metadata)
|
| 278 |
+
processed_embeddings.append(embedding)
|
| 279 |
+
stats["processed"] += 1
|
| 280 |
+
|
| 281 |
+
# Add processed documents to PostgreSQL
|
| 282 |
+
if processed_docs:
|
| 283 |
+
try:
|
| 284 |
+
doc_ids = self.postgres_service.add_documents(
|
| 285 |
+
texts=processed_docs,
|
| 286 |
+
embeddings=processed_embeddings,
|
| 287 |
+
metadatas=processed_metadata,
|
| 288 |
+
)
|
| 289 |
+
logger.info(f"Added {len(doc_ids)} documents to PostgreSQL")
|
| 290 |
+
except Exception as e:
|
| 291 |
+
logger.error(f"Failed to add documents to PostgreSQL: {e}")
|
| 292 |
+
raise
|
| 293 |
+
|
| 294 |
+
# Force garbage collection
|
| 295 |
+
gc.collect()
|
| 296 |
+
|
| 297 |
+
return stats
|
| 298 |
+
|
| 299 |
+
def migrate(self) -> Dict[str, int]:
|
| 300 |
+
"""
|
| 301 |
+
Perform the complete migration.
|
| 302 |
+
|
| 303 |
+
Returns:
|
| 304 |
+
Migration statistics
|
| 305 |
+
"""
|
| 306 |
+
logger.info("Starting ChromaDB to PostgreSQL migration...")
|
| 307 |
+
|
| 308 |
+
self.initialize_services()
|
| 309 |
+
|
| 310 |
+
# Clear existing PostgreSQL data
|
| 311 |
+
logger.info("Clearing existing PostgreSQL data...")
|
| 312 |
+
deleted_count = self.postgres_service.delete_all_documents()
|
| 313 |
+
logger.info(f"Deleted {deleted_count} existing documents from PostgreSQL")
|
| 314 |
+
|
| 315 |
+
total_stats = {"processed": 0, "skipped": 0, "reembedded": 0}
|
| 316 |
+
batch_count = 0
|
| 317 |
+
|
| 318 |
+
try:
|
| 319 |
+
# Process documents in batches
|
| 320 |
+
for batch in self.get_chroma_documents():
|
| 321 |
+
batch_count += 1
|
| 322 |
+
logger.info(f"Processing batch {batch_count}...")
|
| 323 |
+
|
| 324 |
+
batch_stats = self.process_batch(batch)
|
| 325 |
+
|
| 326 |
+
# Update totals
|
| 327 |
+
for key in total_stats:
|
| 328 |
+
total_stats[key] += batch_stats[key]
|
| 329 |
+
|
| 330 |
+
logger.info(f"Batch {batch_count} complete: {batch_stats}")
|
| 331 |
+
|
| 332 |
+
# Memory cleanup between batches
|
| 333 |
+
gc.collect()
|
| 334 |
+
|
| 335 |
+
# Final statistics
|
| 336 |
+
logger.info("Migration completed successfully!")
|
| 337 |
+
logger.info(f"Final statistics: {total_stats}")
|
| 338 |
+
|
| 339 |
+
# Verify migration
|
| 340 |
+
postgres_info = self.postgres_service.get_collection_info()
|
| 341 |
+
logger.info(f"PostgreSQL collection info: {postgres_info}")
|
| 342 |
+
|
| 343 |
+
return total_stats
|
| 344 |
+
|
| 345 |
+
except Exception as e:
|
| 346 |
+
logger.error(f"Migration failed: {e}")
|
| 347 |
+
raise
|
| 348 |
+
|
| 349 |
+
def test_migration(self, test_query: str = "policy") -> Dict[str, Any]:
|
| 350 |
+
"""
|
| 351 |
+
Test the migrated data by performing a search.
|
| 352 |
+
|
| 353 |
+
Args:
|
| 354 |
+
test_query: Query to test with
|
| 355 |
+
|
| 356 |
+
Returns:
|
| 357 |
+
Test results
|
| 358 |
+
"""
|
| 359 |
+
logger.info(f"Testing migration with query: '{test_query}'")
|
| 360 |
+
|
| 361 |
+
try:
|
| 362 |
+
# Generate query embedding
|
| 363 |
+
query_embedding = self.embedding_service.generate_embeddings([test_query])[
|
| 364 |
+
0
|
| 365 |
+
]
|
| 366 |
+
|
| 367 |
+
# Search PostgreSQL
|
| 368 |
+
results = self.postgres_service.similarity_search(query_embedding, k=5)
|
| 369 |
+
|
| 370 |
+
logger.info(f"Test search returned {len(results)} results")
|
| 371 |
+
for i, result in enumerate(results):
|
| 372 |
+
logger.info(
|
| 373 |
+
f"Result {i+1}: {result['content'][:100]}... (score: {result.get('similarity_score', 0):.3f})"
|
| 374 |
+
)
|
| 375 |
+
|
| 376 |
+
return {
|
| 377 |
+
"query": test_query,
|
| 378 |
+
"results_count": len(results),
|
| 379 |
+
"results": results,
|
| 380 |
+
}
|
| 381 |
+
|
| 382 |
+
except Exception as e:
|
| 383 |
+
logger.error(f"Migration test failed: {e}")
|
| 384 |
+
return {"error": str(e)}
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
def main():
|
| 388 |
+
"""Main migration function."""
|
| 389 |
+
import argparse
|
| 390 |
+
|
| 391 |
+
parser = argparse.ArgumentParser(description="Migrate ChromaDB to PostgreSQL")
|
| 392 |
+
parser.add_argument("--database-url", help="PostgreSQL connection URL")
|
| 393 |
+
parser.add_argument(
|
| 394 |
+
"--test-only", action="store_true", help="Only run migration test"
|
| 395 |
+
)
|
| 396 |
+
parser.add_argument(
|
| 397 |
+
"--dry-run",
|
| 398 |
+
action="store_true",
|
| 399 |
+
help="Show what would be migrated without actually migrating",
|
| 400 |
+
)
|
| 401 |
+
|
| 402 |
+
args = parser.parse_args()
|
| 403 |
+
|
| 404 |
+
try:
|
| 405 |
+
migrator = ChromaToPostgresMigrator(database_url=args.database_url)
|
| 406 |
+
|
| 407 |
+
if args.test_only:
|
| 408 |
+
# Only test existing migration
|
| 409 |
+
migrator.initialize_services()
|
| 410 |
+
results = migrator.test_migration()
|
| 411 |
+
print(f"Test results: {results}")
|
| 412 |
+
elif args.dry_run:
|
| 413 |
+
# Show what would be migrated
|
| 414 |
+
migrator.initialize_services()
|
| 415 |
+
total_docs = migrator.chroma_db.get_count()
|
| 416 |
+
logger.info(
|
| 417 |
+
f"Would migrate {total_docs} documents from ChromaDB to PostgreSQL"
|
| 418 |
+
)
|
| 419 |
+
else:
|
| 420 |
+
# Perform actual migration
|
| 421 |
+
stats = migrator.migrate()
|
| 422 |
+
logger.info(f"Migration complete: {stats}")
|
| 423 |
+
|
| 424 |
+
# Test the migration
|
| 425 |
+
test_results = migrator.test_migration()
|
| 426 |
+
logger.info(f"Migration test: {test_results}")
|
| 427 |
+
|
| 428 |
+
except Exception as e:
|
| 429 |
+
logger.error(f"Migration script failed: {e}")
|
| 430 |
+
sys.exit(1)
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
if __name__ == "__main__":
|
| 434 |
+
main()
|
src/config.py
CHANGED
|
@@ -1,5 +1,7 @@
|
|
| 1 |
"""Configuration settings for the ingestion pipeline"""
|
| 2 |
|
|
|
|
|
|
|
| 3 |
# Default ingestion settings
|
| 4 |
DEFAULT_CHUNK_SIZE = 1000
|
| 5 |
DEFAULT_OVERLAP = 200
|
|
@@ -12,25 +14,42 @@ SUPPORTED_FORMATS = {".txt", ".md", ".markdown"}
|
|
| 12 |
CORPUS_DIRECTORY = "synthetic_policies"
|
| 13 |
|
| 14 |
# Vector Database Settings
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
COLLECTION_NAME = "policy_documents"
|
| 17 |
EMBEDDING_DIMENSION = 384 # paraphrase-MiniLM-L3-v2 (smaller, memory-efficient)
|
| 18 |
SIMILARITY_METRIC = "cosine"
|
| 19 |
|
| 20 |
-
# ChromaDB Configuration for Memory Optimization
|
| 21 |
CHROMA_SETTINGS = {
|
| 22 |
"anonymized_telemetry": False,
|
| 23 |
"allow_reset": False,
|
| 24 |
}
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
# Embedding Model Settings
|
| 27 |
-
|
| 28 |
-
EMBEDDING_MODEL_NAME = (
|
| 29 |
-
"all-MiniLM-L12-v2" # Ultra-lightweight model (384 dim, minimal memory)
|
| 30 |
-
)
|
| 31 |
EMBEDDING_BATCH_SIZE = 1 # Absolute minimum for extreme memory constraints
|
| 32 |
EMBEDDING_DEVICE = "cpu" # Use CPU for free tier compatibility
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
# Search Settings
|
| 35 |
DEFAULT_TOP_K = 5
|
| 36 |
MAX_TOP_K = 20
|
|
|
|
| 1 |
"""Configuration settings for the ingestion pipeline"""
|
| 2 |
|
| 3 |
+
import os
|
| 4 |
+
|
| 5 |
# Default ingestion settings
|
| 6 |
DEFAULT_CHUNK_SIZE = 1000
|
| 7 |
DEFAULT_OVERLAP = 200
|
|
|
|
| 14 |
CORPUS_DIRECTORY = "synthetic_policies"
|
| 15 |
|
| 16 |
# Vector Database Settings
|
| 17 |
+
VECTOR_STORAGE_TYPE = os.getenv(
|
| 18 |
+
"VECTOR_STORAGE_TYPE", "chroma"
|
| 19 |
+
) # "chroma" or "postgres"
|
| 20 |
+
VECTOR_DB_PERSIST_PATH = "data/chroma_db" # Used for ChromaDB
|
| 21 |
+
DATABASE_URL = os.getenv("DATABASE_URL") # Used for PostgreSQL
|
| 22 |
COLLECTION_NAME = "policy_documents"
|
| 23 |
EMBEDDING_DIMENSION = 384 # paraphrase-MiniLM-L3-v2 (smaller, memory-efficient)
|
| 24 |
SIMILARITY_METRIC = "cosine"
|
| 25 |
|
| 26 |
+
# ChromaDB Configuration for Memory Optimization (when using ChromaDB)
|
| 27 |
CHROMA_SETTINGS = {
|
| 28 |
"anonymized_telemetry": False,
|
| 29 |
"allow_reset": False,
|
| 30 |
}
|
| 31 |
|
| 32 |
+
# PostgreSQL Configuration (when using PostgreSQL)
|
| 33 |
+
POSTGRES_TABLE_NAME = "document_embeddings"
|
| 34 |
+
POSTGRES_MAX_CONNECTIONS = 10
|
| 35 |
+
|
| 36 |
# Embedding Model Settings
|
| 37 |
+
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2" # Ultra-lightweight
|
|
|
|
|
|
|
|
|
|
| 38 |
EMBEDDING_BATCH_SIZE = 1 # Absolute minimum for extreme memory constraints
|
| 39 |
EMBEDDING_DEVICE = "cpu" # Use CPU for free tier compatibility
|
| 40 |
|
| 41 |
+
# Document Processing Settings (for memory optimization)
|
| 42 |
+
MAX_DOCUMENT_LENGTH = 1000 # Truncate documents to reduce memory usage
|
| 43 |
+
MAX_DOCUMENTS_IN_MEMORY = 100 # Process documents in small batches
|
| 44 |
+
|
| 45 |
+
# Memory Management Settings
|
| 46 |
+
ENABLE_MEMORY_MONITORING = (
|
| 47 |
+
os.getenv("ENABLE_MEMORY_MONITORING", "true").lower() == "true"
|
| 48 |
+
)
|
| 49 |
+
MEMORY_LIMIT_MB = int(
|
| 50 |
+
os.getenv("MEMORY_LIMIT_MB", "400")
|
| 51 |
+
) # Conservative limit for 512MB instances
|
| 52 |
+
|
| 53 |
# Search Settings
|
| 54 |
DEFAULT_TOP_K = 5
|
| 55 |
MAX_TOP_K = 20
|
src/vector_db/postgres_adapter.py
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Adapter to make PostgresVectorService compatible with the existing VectorDatabase interface.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
from typing import Any, Dict, List
|
| 7 |
+
|
| 8 |
+
from src.vector_db.postgres_vector_service import PostgresVectorService
|
| 9 |
+
|
| 10 |
+
logger = logging.getLogger(__name__)
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class PostgresVectorAdapter:
|
| 14 |
+
"""Adapter to make PostgresVectorService compatible with VectorDatabase interface."""
|
| 15 |
+
|
| 16 |
+
def __init__(self, table_name: str = "document_embeddings"):
|
| 17 |
+
"""Initialize the PostgreSQL vector adapter."""
|
| 18 |
+
self.service = PostgresVectorService(table_name=table_name)
|
| 19 |
+
self.collection_name = table_name
|
| 20 |
+
|
| 21 |
+
def add_embeddings_batch(
|
| 22 |
+
self,
|
| 23 |
+
batch_embeddings: List[List[List[float]]],
|
| 24 |
+
batch_chunk_ids: List[List[str]],
|
| 25 |
+
batch_documents: List[List[str]],
|
| 26 |
+
batch_metadatas: List[List[Dict[str, Any]]],
|
| 27 |
+
) -> int:
|
| 28 |
+
"""Add embeddings in batches - compatible with ChromaDB interface."""
|
| 29 |
+
total_added = 0
|
| 30 |
+
|
| 31 |
+
for embeddings, chunk_ids, documents, metadatas in zip(
|
| 32 |
+
batch_embeddings, batch_chunk_ids, batch_documents, batch_metadatas
|
| 33 |
+
):
|
| 34 |
+
added = self.add_embeddings(embeddings, chunk_ids, documents, metadatas)
|
| 35 |
+
if isinstance(added, bool) and added:
|
| 36 |
+
total_added += len(embeddings)
|
| 37 |
+
elif isinstance(added, int):
|
| 38 |
+
total_added += added
|
| 39 |
+
|
| 40 |
+
return total_added
|
| 41 |
+
|
| 42 |
+
def add_embeddings(
|
| 43 |
+
self,
|
| 44 |
+
embeddings: List[List[float]],
|
| 45 |
+
chunk_ids: List[str],
|
| 46 |
+
documents: List[str],
|
| 47 |
+
metadatas: List[Dict[str, Any]],
|
| 48 |
+
) -> bool:
|
| 49 |
+
"""Add embeddings to PostgreSQL - compatible with ChromaDB interface."""
|
| 50 |
+
try:
|
| 51 |
+
doc_ids = self.service.add_documents(documents, embeddings, metadatas)
|
| 52 |
+
return len(doc_ids) == len(embeddings)
|
| 53 |
+
except Exception as e:
|
| 54 |
+
logger.error(f"Failed to add embeddings: {e}")
|
| 55 |
+
raise
|
| 56 |
+
|
| 57 |
+
def search(
|
| 58 |
+
self, query_embedding: List[float], top_k: int = 5
|
| 59 |
+
) -> List[Dict[str, Any]]:
|
| 60 |
+
"""Search for similar embeddings - compatible with ChromaDB interface."""
|
| 61 |
+
try:
|
| 62 |
+
results = self.service.similarity_search(query_embedding, k=top_k)
|
| 63 |
+
|
| 64 |
+
# Convert PostgreSQL results to ChromaDB-compatible format
|
| 65 |
+
formatted_results = []
|
| 66 |
+
for i, result in enumerate(results):
|
| 67 |
+
formatted_result = {
|
| 68 |
+
"id": result["id"],
|
| 69 |
+
"document": result["content"],
|
| 70 |
+
"metadata": result["metadata"],
|
| 71 |
+
"distance": 1.0
|
| 72 |
+
- result.get(
|
| 73 |
+
"similarity_score", 0.0
|
| 74 |
+
), # Convert similarity to distance
|
| 75 |
+
}
|
| 76 |
+
formatted_results.append(formatted_result)
|
| 77 |
+
|
| 78 |
+
return formatted_results
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
logger.error(f"Search failed: {e}")
|
| 82 |
+
return []
|
| 83 |
+
|
| 84 |
+
def get_count(self) -> int:
|
| 85 |
+
"""Get the number of embeddings in the collection."""
|
| 86 |
+
try:
|
| 87 |
+
info = self.service.get_collection_info()
|
| 88 |
+
return info.get("document_count", 0)
|
| 89 |
+
except Exception as e:
|
| 90 |
+
logger.error(f"Failed to get count: {e}")
|
| 91 |
+
return 0
|
| 92 |
+
|
| 93 |
+
def delete_collection(self) -> bool:
|
| 94 |
+
"""Delete all documents from the collection."""
|
| 95 |
+
try:
|
| 96 |
+
deleted_count = self.service.delete_all_documents()
|
| 97 |
+
return deleted_count >= 0
|
| 98 |
+
except Exception as e:
|
| 99 |
+
logger.error(f"Failed to delete collection: {e}")
|
| 100 |
+
return False
|
| 101 |
+
|
| 102 |
+
def reset_collection(self) -> bool:
|
| 103 |
+
"""Reset the collection (delete all documents)."""
|
| 104 |
+
return self.delete_collection()
|
| 105 |
+
|
| 106 |
+
def get_collection(self):
|
| 107 |
+
"""Get the underlying service (for compatibility)."""
|
| 108 |
+
return self.service
|
| 109 |
+
|
| 110 |
+
def get_embedding_dimension(self) -> int:
|
| 111 |
+
"""Get the embedding dimension."""
|
| 112 |
+
try:
|
| 113 |
+
info = self.service.get_collection_info()
|
| 114 |
+
return info.get("embedding_dimension", 0) or 0
|
| 115 |
+
except Exception as e:
|
| 116 |
+
logger.error(f"Failed to get embedding dimension: {e}")
|
| 117 |
+
return 0
|
| 118 |
+
|
| 119 |
+
def has_valid_embeddings(self, expected_dimension: int) -> bool:
|
| 120 |
+
"""Check if the collection has embeddings with the expected dimension."""
|
| 121 |
+
try:
|
| 122 |
+
actual_dimension = self.get_embedding_dimension()
|
| 123 |
+
return actual_dimension == expected_dimension and actual_dimension > 0
|
| 124 |
+
except Exception as e:
|
| 125 |
+
logger.error(f"Failed to validate embeddings: {e}")
|
| 126 |
+
return False
|
src/vector_db/postgres_vector_service.py
ADDED
|
@@ -0,0 +1,473 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PostgreSQL vector database service using pgvector extension.
|
| 3 |
+
This service provides persistent vector storage with efficient similarity search.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import json
|
| 7 |
+
import logging
|
| 8 |
+
import os
|
| 9 |
+
from contextlib import contextmanager
|
| 10 |
+
from typing import Any, Dict, List, Optional
|
| 11 |
+
|
| 12 |
+
import numpy as np
|
| 13 |
+
import psycopg2
|
| 14 |
+
import psycopg2.extras
|
| 15 |
+
|
| 16 |
+
logger = logging.getLogger(__name__)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
class PostgresVectorService:
|
| 20 |
+
"""Vector database service using PostgreSQL with pgvector extension."""
|
| 21 |
+
|
| 22 |
+
def __init__(
|
| 23 |
+
self,
|
| 24 |
+
connection_string: Optional[str] = None,
|
| 25 |
+
table_name: str = "document_embeddings",
|
| 26 |
+
):
|
| 27 |
+
"""
|
| 28 |
+
Initialize PostgreSQL vector service.
|
| 29 |
+
|
| 30 |
+
Args:
|
| 31 |
+
connection_string: PostgreSQL connection string. If None, uses DATABASE_URL env var.
|
| 32 |
+
table_name: Name of the table to store embeddings.
|
| 33 |
+
"""
|
| 34 |
+
self.connection_string = connection_string or os.getenv("DATABASE_URL")
|
| 35 |
+
if not self.connection_string:
|
| 36 |
+
raise ValueError("DATABASE_URL environment variable is required")
|
| 37 |
+
|
| 38 |
+
self.table_name = table_name
|
| 39 |
+
self.dimension = None # Will be set based on first embedding
|
| 40 |
+
|
| 41 |
+
# Test connection and create table
|
| 42 |
+
self._initialize_database()
|
| 43 |
+
|
| 44 |
+
@contextmanager
|
| 45 |
+
def _get_connection(self):
|
| 46 |
+
"""Context manager for database connections."""
|
| 47 |
+
conn = None
|
| 48 |
+
try:
|
| 49 |
+
conn = psycopg2.connect(self.connection_string)
|
| 50 |
+
yield conn
|
| 51 |
+
except Exception as e:
|
| 52 |
+
if conn:
|
| 53 |
+
conn.rollback()
|
| 54 |
+
logger.error(f"Database connection error: {e}")
|
| 55 |
+
raise
|
| 56 |
+
finally:
|
| 57 |
+
if conn:
|
| 58 |
+
conn.close()
|
| 59 |
+
|
| 60 |
+
def _initialize_database(self):
|
| 61 |
+
"""Initialize database with required extensions and tables."""
|
| 62 |
+
with self._get_connection() as conn:
|
| 63 |
+
with conn.cursor() as cur:
|
| 64 |
+
# Enable pgvector extension
|
| 65 |
+
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
|
| 66 |
+
|
| 67 |
+
# Create table with initial structure (dimension will be added later)
|
| 68 |
+
cur.execute(
|
| 69 |
+
f"""
|
| 70 |
+
CREATE TABLE IF NOT EXISTS {self.table_name} (
|
| 71 |
+
id SERIAL PRIMARY KEY,
|
| 72 |
+
content TEXT NOT NULL,
|
| 73 |
+
embedding vector,
|
| 74 |
+
metadata JSONB DEFAULT '{{}}',
|
| 75 |
+
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
| 76 |
+
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
| 77 |
+
);
|
| 78 |
+
"""
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
# Create index for text search
|
| 82 |
+
cur.execute(
|
| 83 |
+
f"""
|
| 84 |
+
CREATE INDEX IF NOT EXISTS idx_{self.table_name}_content
|
| 85 |
+
ON {self.table_name} USING gin(to_tsvector('english', content));
|
| 86 |
+
"""
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
conn.commit()
|
| 90 |
+
logger.info(f"Database initialized with table: {self.table_name}")
|
| 91 |
+
|
| 92 |
+
def _ensure_embedding_dimension(self, dimension: int):
|
| 93 |
+
"""Ensure the embedding column has the correct dimension."""
|
| 94 |
+
if self.dimension == dimension:
|
| 95 |
+
return
|
| 96 |
+
|
| 97 |
+
with self._get_connection() as conn:
|
| 98 |
+
with conn.cursor() as cur:
|
| 99 |
+
# Check if we need to alter the table
|
| 100 |
+
cur.execute(
|
| 101 |
+
f"""
|
| 102 |
+
SELECT column_name, data_type, character_maximum_length
|
| 103 |
+
FROM information_schema.columns
|
| 104 |
+
WHERE table_name = %s AND column_name = 'embedding';
|
| 105 |
+
""",
|
| 106 |
+
(self.table_name,),
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
result = cur.fetchone()
|
| 110 |
+
if result and f"vector({dimension})" not in str(result):
|
| 111 |
+
# Drop existing index if it exists
|
| 112 |
+
cur.execute(
|
| 113 |
+
f"DROP INDEX IF EXISTS idx_{self.table_name}_embedding_cosine;"
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
# Alter column to correct dimension
|
| 117 |
+
cur.execute(
|
| 118 |
+
f"ALTER TABLE {self.table_name} ALTER COLUMN embedding TYPE vector({dimension});"
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# Create optimized index for similarity search
|
| 122 |
+
cur.execute(
|
| 123 |
+
f"""
|
| 124 |
+
CREATE INDEX IF NOT EXISTS idx_{self.table_name}_embedding_cosine
|
| 125 |
+
ON {self.table_name}
|
| 126 |
+
USING ivfflat (embedding vector_cosine_ops)
|
| 127 |
+
WITH (lists = 100);
|
| 128 |
+
"""
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
conn.commit()
|
| 132 |
+
logger.info(f"Updated embedding dimension to {dimension}")
|
| 133 |
+
|
| 134 |
+
self.dimension = dimension
|
| 135 |
+
|
| 136 |
+
def add_documents(
|
| 137 |
+
self,
|
| 138 |
+
texts: List[str],
|
| 139 |
+
embeddings: List[List[float]],
|
| 140 |
+
metadatas: Optional[List[Dict[str, Any]]] = None,
|
| 141 |
+
) -> List[str]:
|
| 142 |
+
"""
|
| 143 |
+
Add documents with their embeddings to the database.
|
| 144 |
+
|
| 145 |
+
Args:
|
| 146 |
+
texts: List of document texts
|
| 147 |
+
embeddings: List of embedding vectors
|
| 148 |
+
metadatas: Optional list of metadata dictionaries
|
| 149 |
+
|
| 150 |
+
Returns:
|
| 151 |
+
List of document IDs
|
| 152 |
+
"""
|
| 153 |
+
if not texts or not embeddings:
|
| 154 |
+
return []
|
| 155 |
+
|
| 156 |
+
if len(texts) != len(embeddings):
|
| 157 |
+
raise ValueError("Number of texts must match number of embeddings")
|
| 158 |
+
|
| 159 |
+
if metadatas and len(metadatas) != len(texts):
|
| 160 |
+
raise ValueError("Number of metadatas must match number of texts")
|
| 161 |
+
|
| 162 |
+
# Ensure embedding dimension is set
|
| 163 |
+
if embeddings:
|
| 164 |
+
self._ensure_embedding_dimension(len(embeddings[0]))
|
| 165 |
+
|
| 166 |
+
# Default empty metadata if not provided
|
| 167 |
+
if metadatas is None:
|
| 168 |
+
metadatas = [{}] * len(texts)
|
| 169 |
+
|
| 170 |
+
document_ids = []
|
| 171 |
+
|
| 172 |
+
with self._get_connection() as conn:
|
| 173 |
+
with conn.cursor() as cur:
|
| 174 |
+
for text, embedding, metadata in zip(texts, embeddings, metadatas):
|
| 175 |
+
# Insert document and get ID
|
| 176 |
+
cur.execute(
|
| 177 |
+
f"""
|
| 178 |
+
INSERT INTO {self.table_name} (content, embedding, metadata)
|
| 179 |
+
VALUES (%s, %s, %s)
|
| 180 |
+
RETURNING id;
|
| 181 |
+
""",
|
| 182 |
+
(text, embedding, psycopg2.extras.Json(metadata)),
|
| 183 |
+
)
|
| 184 |
+
|
| 185 |
+
doc_id = cur.fetchone()[0]
|
| 186 |
+
document_ids.append(str(doc_id))
|
| 187 |
+
|
| 188 |
+
conn.commit()
|
| 189 |
+
logger.info(f"Added {len(document_ids)} documents to database")
|
| 190 |
+
|
| 191 |
+
return document_ids
|
| 192 |
+
|
| 193 |
+
def similarity_search(
|
| 194 |
+
self,
|
| 195 |
+
query_embedding: List[float],
|
| 196 |
+
k: int = 5,
|
| 197 |
+
filter_metadata: Optional[Dict[str, Any]] = None,
|
| 198 |
+
) -> List[Dict]:
|
| 199 |
+
"""
|
| 200 |
+
Perform similarity search using cosine distance.
|
| 201 |
+
|
| 202 |
+
Args:
|
| 203 |
+
query_embedding: Query embedding vector
|
| 204 |
+
k: Number of results to return
|
| 205 |
+
filter_metadata: Optional metadata filters
|
| 206 |
+
|
| 207 |
+
Returns:
|
| 208 |
+
List of documents with similarity scores
|
| 209 |
+
"""
|
| 210 |
+
if not query_embedding:
|
| 211 |
+
return []
|
| 212 |
+
|
| 213 |
+
# Build WHERE clause for metadata filtering
|
| 214 |
+
where_clause = ""
|
| 215 |
+
params = [query_embedding, query_embedding, k]
|
| 216 |
+
|
| 217 |
+
if filter_metadata:
|
| 218 |
+
conditions = []
|
| 219 |
+
for key, value in filter_metadata.items():
|
| 220 |
+
if isinstance(value, str):
|
| 221 |
+
conditions.append(f"metadata->>%s = %s")
|
| 222 |
+
params.insert(-1, key)
|
| 223 |
+
params.insert(-1, value)
|
| 224 |
+
elif isinstance(value, (int, float)):
|
| 225 |
+
conditions.append(f"(metadata->>%s)::numeric = %s")
|
| 226 |
+
params.insert(-1, key)
|
| 227 |
+
params.insert(-1, value)
|
| 228 |
+
|
| 229 |
+
if conditions:
|
| 230 |
+
where_clause = "WHERE " + " AND ".join(conditions)
|
| 231 |
+
|
| 232 |
+
query = f"""
|
| 233 |
+
SELECT id, content, metadata,
|
| 234 |
+
1 - (embedding <=> %s) as similarity_score
|
| 235 |
+
FROM {self.table_name}
|
| 236 |
+
{where_clause}
|
| 237 |
+
ORDER BY embedding <=> %s
|
| 238 |
+
LIMIT %s;
|
| 239 |
+
"""
|
| 240 |
+
|
| 241 |
+
with self._get_connection() as conn:
|
| 242 |
+
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
| 243 |
+
cur.execute(query, params)
|
| 244 |
+
results = cur.fetchall()
|
| 245 |
+
|
| 246 |
+
return [
|
| 247 |
+
{
|
| 248 |
+
"id": str(row["id"]),
|
| 249 |
+
"content": row["content"],
|
| 250 |
+
"metadata": row["metadata"] or {},
|
| 251 |
+
"similarity_score": float(row["similarity_score"]),
|
| 252 |
+
}
|
| 253 |
+
for row in results
|
| 254 |
+
]
|
| 255 |
+
|
| 256 |
+
def get_collection_info(self) -> Dict[str, Any]:
|
| 257 |
+
"""Get information about the vector collection."""
|
| 258 |
+
with self._get_connection() as conn:
|
| 259 |
+
with conn.cursor() as cur:
|
| 260 |
+
# Get document count
|
| 261 |
+
cur.execute(f"SELECT COUNT(*) FROM {self.table_name}")
|
| 262 |
+
doc_count = cur.fetchone()[0]
|
| 263 |
+
|
| 264 |
+
# Get table size
|
| 265 |
+
cur.execute(
|
| 266 |
+
f"""
|
| 267 |
+
SELECT pg_size_pretty(pg_total_relation_size(%s)) as size;
|
| 268 |
+
""",
|
| 269 |
+
(self.table_name,),
|
| 270 |
+
)
|
| 271 |
+
table_size = cur.fetchone()[0]
|
| 272 |
+
|
| 273 |
+
# Get dimension info
|
| 274 |
+
cur.execute(
|
| 275 |
+
f"""
|
| 276 |
+
SELECT column_name, data_type
|
| 277 |
+
FROM information_schema.columns
|
| 278 |
+
WHERE table_name = %s AND column_name = 'embedding';
|
| 279 |
+
""",
|
| 280 |
+
(self.table_name,),
|
| 281 |
+
)
|
| 282 |
+
embedding_info = cur.fetchone()
|
| 283 |
+
|
| 284 |
+
return {
|
| 285 |
+
"document_count": doc_count,
|
| 286 |
+
"table_size": table_size,
|
| 287 |
+
"embedding_dimension": self.dimension,
|
| 288 |
+
"table_name": self.table_name,
|
| 289 |
+
"embedding_column_type": (
|
| 290 |
+
embedding_info[1] if embedding_info else None
|
| 291 |
+
),
|
| 292 |
+
}
|
| 293 |
+
|
| 294 |
+
def delete_documents(self, document_ids: List[str]) -> int:
|
| 295 |
+
"""
|
| 296 |
+
Delete documents by their IDs.
|
| 297 |
+
|
| 298 |
+
Args:
|
| 299 |
+
document_ids: List of document IDs to delete
|
| 300 |
+
|
| 301 |
+
Returns:
|
| 302 |
+
Number of documents deleted
|
| 303 |
+
"""
|
| 304 |
+
if not document_ids:
|
| 305 |
+
return 0
|
| 306 |
+
|
| 307 |
+
with self._get_connection() as conn:
|
| 308 |
+
with conn.cursor() as cur:
|
| 309 |
+
# Convert string IDs to integers
|
| 310 |
+
int_ids = [int(doc_id) for doc_id in document_ids]
|
| 311 |
+
|
| 312 |
+
cur.execute(
|
| 313 |
+
f"""
|
| 314 |
+
DELETE FROM {self.table_name}
|
| 315 |
+
WHERE id = ANY(%s)
|
| 316 |
+
""",
|
| 317 |
+
(int_ids,),
|
| 318 |
+
)
|
| 319 |
+
|
| 320 |
+
deleted_count = cur.rowcount
|
| 321 |
+
conn.commit()
|
| 322 |
+
|
| 323 |
+
logger.info(f"Deleted {deleted_count} documents")
|
| 324 |
+
return deleted_count
|
| 325 |
+
|
| 326 |
+
def delete_all_documents(self) -> int:
|
| 327 |
+
"""
|
| 328 |
+
Delete all documents from the collection.
|
| 329 |
+
|
| 330 |
+
Returns:
|
| 331 |
+
Number of documents deleted
|
| 332 |
+
"""
|
| 333 |
+
with self._get_connection() as conn:
|
| 334 |
+
with conn.cursor() as cur:
|
| 335 |
+
cur.execute(f"SELECT COUNT(*) FROM {self.table_name}")
|
| 336 |
+
count_before = cur.fetchone()[0]
|
| 337 |
+
|
| 338 |
+
cur.execute(f"DELETE FROM {self.table_name}")
|
| 339 |
+
|
| 340 |
+
# Reset the sequence
|
| 341 |
+
cur.execute(f"ALTER SEQUENCE {self.table_name}_id_seq RESTART WITH 1")
|
| 342 |
+
|
| 343 |
+
conn.commit()
|
| 344 |
+
logger.info(f"Deleted all {count_before} documents")
|
| 345 |
+
return count_before
|
| 346 |
+
|
| 347 |
+
def update_document(
|
| 348 |
+
self,
|
| 349 |
+
document_id: str,
|
| 350 |
+
content: Optional[str] = None,
|
| 351 |
+
embedding: Optional[List[float]] = None,
|
| 352 |
+
metadata: Optional[Dict[str, Any]] = None,
|
| 353 |
+
) -> bool:
|
| 354 |
+
"""
|
| 355 |
+
Update a document's content, embedding, or metadata.
|
| 356 |
+
|
| 357 |
+
Args:
|
| 358 |
+
document_id: ID of document to update
|
| 359 |
+
content: New content (optional)
|
| 360 |
+
embedding: New embedding (optional)
|
| 361 |
+
metadata: New metadata (optional)
|
| 362 |
+
|
| 363 |
+
Returns:
|
| 364 |
+
True if document was updated, False if not found
|
| 365 |
+
"""
|
| 366 |
+
if not any([content, embedding, metadata]):
|
| 367 |
+
return False
|
| 368 |
+
|
| 369 |
+
updates = []
|
| 370 |
+
params = []
|
| 371 |
+
|
| 372 |
+
if content is not None:
|
| 373 |
+
updates.append("content = %s")
|
| 374 |
+
params.append(content)
|
| 375 |
+
|
| 376 |
+
if embedding is not None:
|
| 377 |
+
updates.append("embedding = %s")
|
| 378 |
+
params.append(embedding)
|
| 379 |
+
|
| 380 |
+
if metadata is not None:
|
| 381 |
+
updates.append("metadata = %s")
|
| 382 |
+
params.append(psycopg2.extras.Json(metadata))
|
| 383 |
+
|
| 384 |
+
updates.append("updated_at = CURRENT_TIMESTAMP")
|
| 385 |
+
params.append(int(document_id))
|
| 386 |
+
|
| 387 |
+
query = f"""
|
| 388 |
+
UPDATE {self.table_name}
|
| 389 |
+
SET {', '.join(updates)}
|
| 390 |
+
WHERE id = %s
|
| 391 |
+
"""
|
| 392 |
+
|
| 393 |
+
with self._get_connection() as conn:
|
| 394 |
+
with conn.cursor() as cur:
|
| 395 |
+
cur.execute(query, params)
|
| 396 |
+
updated = cur.rowcount > 0
|
| 397 |
+
conn.commit()
|
| 398 |
+
|
| 399 |
+
if updated:
|
| 400 |
+
logger.info(f"Updated document {document_id}")
|
| 401 |
+
else:
|
| 402 |
+
logger.warning(f"Document {document_id} not found for update")
|
| 403 |
+
|
| 404 |
+
return updated
|
| 405 |
+
|
| 406 |
+
def get_document(self, document_id: str) -> Optional[Dict[str, Any]]:
|
| 407 |
+
"""
|
| 408 |
+
Get a single document by ID.
|
| 409 |
+
|
| 410 |
+
Args:
|
| 411 |
+
document_id: ID of document to retrieve
|
| 412 |
+
|
| 413 |
+
Returns:
|
| 414 |
+
Document dictionary or None if not found
|
| 415 |
+
"""
|
| 416 |
+
with self._get_connection() as conn:
|
| 417 |
+
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
| 418 |
+
cur.execute(
|
| 419 |
+
f"""
|
| 420 |
+
SELECT id, content, metadata, created_at, updated_at
|
| 421 |
+
FROM {self.table_name}
|
| 422 |
+
WHERE id = %s
|
| 423 |
+
""",
|
| 424 |
+
(int(document_id),),
|
| 425 |
+
)
|
| 426 |
+
|
| 427 |
+
row = cur.fetchone()
|
| 428 |
+
if row:
|
| 429 |
+
return {
|
| 430 |
+
"id": str(row["id"]),
|
| 431 |
+
"content": row["content"],
|
| 432 |
+
"metadata": row["metadata"] or {},
|
| 433 |
+
"created_at": (
|
| 434 |
+
row["created_at"].isoformat() if row["created_at"] else None
|
| 435 |
+
),
|
| 436 |
+
"updated_at": (
|
| 437 |
+
row["updated_at"].isoformat() if row["updated_at"] else None
|
| 438 |
+
),
|
| 439 |
+
}
|
| 440 |
+
return None
|
| 441 |
+
|
| 442 |
+
def health_check(self) -> Dict[str, Any]:
|
| 443 |
+
"""
|
| 444 |
+
Check the health of the vector database service.
|
| 445 |
+
|
| 446 |
+
Returns:
|
| 447 |
+
Health status dictionary
|
| 448 |
+
"""
|
| 449 |
+
try:
|
| 450 |
+
with self._get_connection() as conn:
|
| 451 |
+
with conn.cursor() as cur:
|
| 452 |
+
# Test basic connectivity
|
| 453 |
+
cur.execute("SELECT 1")
|
| 454 |
+
|
| 455 |
+
# Check if pgvector extension is installed
|
| 456 |
+
cur.execute(
|
| 457 |
+
"SELECT EXISTS(SELECT 1 FROM pg_extension WHERE extname = 'vector')"
|
| 458 |
+
)
|
| 459 |
+
pgvector_installed = cur.fetchone()[0]
|
| 460 |
+
|
| 461 |
+
# Get basic stats
|
| 462 |
+
info = self.get_collection_info()
|
| 463 |
+
|
| 464 |
+
return {
|
| 465 |
+
"status": "healthy",
|
| 466 |
+
"pgvector_installed": pgvector_installed,
|
| 467 |
+
"connection": "ok",
|
| 468 |
+
"collection_info": info,
|
| 469 |
+
}
|
| 470 |
+
|
| 471 |
+
except Exception as e:
|
| 472 |
+
logger.error(f"Health check failed: {e}")
|
| 473 |
+
return {"status": "unhealthy", "error": str(e), "connection": "failed"}
|
src/vector_store/vector_db.py
CHANGED
|
@@ -1,12 +1,42 @@
|
|
| 1 |
import logging
|
| 2 |
from pathlib import Path
|
| 3 |
-
from typing import Any, Dict, List
|
| 4 |
|
| 5 |
import chromadb
|
| 6 |
|
|
|
|
| 7 |
from src.utils.memory_utils import log_memory_checkpoint, memory_monitor
|
| 8 |
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
class VectorDatabase:
|
| 11 |
"""ChromaDB integration for vector storage and similarity search"""
|
| 12 |
|
|
|
|
| 1 |
import logging
|
| 2 |
from pathlib import Path
|
| 3 |
+
from typing import Any, Dict, List, Optional, Protocol, Union
|
| 4 |
|
| 5 |
import chromadb
|
| 6 |
|
| 7 |
+
from src.config import VECTOR_STORAGE_TYPE
|
| 8 |
from src.utils.memory_utils import log_memory_checkpoint, memory_monitor
|
| 9 |
|
| 10 |
|
| 11 |
+
def create_vector_database(
|
| 12 |
+
persist_path: Optional[str] = None, collection_name: Optional[str] = None
|
| 13 |
+
):
|
| 14 |
+
"""
|
| 15 |
+
Factory function to create the appropriate vector database implementation.
|
| 16 |
+
|
| 17 |
+
Args:
|
| 18 |
+
persist_path: Path for persistence (used by ChromaDB)
|
| 19 |
+
collection_name: Name of the collection
|
| 20 |
+
|
| 21 |
+
Returns:
|
| 22 |
+
Vector database implementation
|
| 23 |
+
"""
|
| 24 |
+
if VECTOR_STORAGE_TYPE == "postgres":
|
| 25 |
+
from src.vector_db.postgres_adapter import PostgresVectorAdapter
|
| 26 |
+
|
| 27 |
+
return PostgresVectorAdapter(
|
| 28 |
+
table_name=collection_name or "document_embeddings"
|
| 29 |
+
)
|
| 30 |
+
else:
|
| 31 |
+
# Default to ChromaDB
|
| 32 |
+
from src.config import COLLECTION_NAME, VECTOR_DB_PERSIST_PATH
|
| 33 |
+
|
| 34 |
+
return VectorDatabase(
|
| 35 |
+
persist_path=persist_path or VECTOR_DB_PERSIST_PATH,
|
| 36 |
+
collection_name=collection_name or COLLECTION_NAME,
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
class VectorDatabase:
|
| 41 |
"""ChromaDB integration for vector storage and similarity search"""
|
| 42 |
|
tests/test_embedding/test_embedding_service.py
CHANGED
|
@@ -1,13 +1,14 @@
|
|
|
|
|
|
|
|
| 1 |
from src.embedding.embedding_service import EmbeddingService
|
| 2 |
|
| 3 |
|
| 4 |
def test_embedding_service_initialization():
|
| 5 |
"""Test EmbeddingService initialization"""
|
| 6 |
-
# Test will fail initially - we'll implement EmbeddingService to make it pass
|
| 7 |
service = EmbeddingService()
|
| 8 |
|
| 9 |
assert service is not None
|
| 10 |
-
assert service.model_name == "all-MiniLM-
|
| 11 |
assert service.device == "cpu"
|
| 12 |
|
| 13 |
|
|
@@ -171,12 +172,12 @@ def test_similarity_makes_sense():
|
|
| 171 |
embed3 = service.embed_text(text3)
|
| 172 |
|
| 173 |
# Calculate simple cosine similarity (for validation)
|
| 174 |
-
def cosine_similarity(a, b):
|
| 175 |
import numpy as np
|
| 176 |
|
| 177 |
a_np = np.array(a)
|
| 178 |
b_np = np.array(b)
|
| 179 |
-
return np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np))
|
| 180 |
|
| 181 |
sim_1_2 = cosine_similarity(embed1, embed2) # Similar texts
|
| 182 |
sim_1_3 = cosine_similarity(embed1, embed3) # Different texts
|
|
|
|
| 1 |
+
from typing import List
|
| 2 |
+
|
| 3 |
from src.embedding.embedding_service import EmbeddingService
|
| 4 |
|
| 5 |
|
| 6 |
def test_embedding_service_initialization():
|
| 7 |
"""Test EmbeddingService initialization"""
|
|
|
|
| 8 |
service = EmbeddingService()
|
| 9 |
|
| 10 |
assert service is not None
|
| 11 |
+
assert service.model_name == "sentence-transformers/all-MiniLM-L6-v2"
|
| 12 |
assert service.device == "cpu"
|
| 13 |
|
| 14 |
|
|
|
|
| 172 |
embed3 = service.embed_text(text3)
|
| 173 |
|
| 174 |
# Calculate simple cosine similarity (for validation)
|
| 175 |
+
def cosine_similarity(a: List[float], b: List[float]) -> float:
|
| 176 |
import numpy as np
|
| 177 |
|
| 178 |
a_np = np.array(a)
|
| 179 |
b_np = np.array(b)
|
| 180 |
+
return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))
|
| 181 |
|
| 182 |
sim_1_2 = cosine_similarity(embed1, embed2) # Similar texts
|
| 183 |
sim_1_3 = cosine_similarity(embed1, embed3) # Different texts
|
tests/test_vector_store/test_postgres_vector.py
ADDED
|
@@ -0,0 +1,366 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for PostgresVectorService and PostgresVectorAdapter.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
from typing import Any, Dict, List
|
| 7 |
+
from unittest.mock import MagicMock, Mock, patch
|
| 8 |
+
|
| 9 |
+
import pytest
|
| 10 |
+
|
| 11 |
+
from src.vector_db.postgres_adapter import PostgresVectorAdapter
|
| 12 |
+
from src.vector_db.postgres_vector_service import PostgresVectorService
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class TestPostgresVectorService:
|
| 16 |
+
"""Test PostgresVectorService functionality."""
|
| 17 |
+
|
| 18 |
+
def setup_method(self):
|
| 19 |
+
"""Setup test fixtures."""
|
| 20 |
+
self.test_connection_string = "postgresql://test:test@localhost:5432/test_db"
|
| 21 |
+
self.test_table_name = "test_embeddings"
|
| 22 |
+
|
| 23 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 24 |
+
def test_initialization(self, mock_connect):
|
| 25 |
+
"""Test service initialization."""
|
| 26 |
+
mock_conn = Mock()
|
| 27 |
+
mock_cursor = Mock()
|
| 28 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 29 |
+
mock_connect.return_value = mock_conn
|
| 30 |
+
|
| 31 |
+
service = PostgresVectorService(
|
| 32 |
+
connection_string=self.test_connection_string,
|
| 33 |
+
table_name=self.test_table_name,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
assert service.connection_string == self.test_connection_string
|
| 37 |
+
assert service.table_name == self.test_table_name
|
| 38 |
+
|
| 39 |
+
# Verify initialization queries were called
|
| 40 |
+
mock_cursor.execute.assert_any_call("CREATE EXTENSION IF NOT EXISTS vector;")
|
| 41 |
+
|
| 42 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 43 |
+
def test_add_documents(self, mock_connect):
|
| 44 |
+
"""Test adding documents."""
|
| 45 |
+
mock_conn = Mock()
|
| 46 |
+
mock_cursor = Mock()
|
| 47 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 48 |
+
mock_cursor.fetchone.return_value = [1] # Mock returned ID
|
| 49 |
+
mock_connect.return_value = mock_conn
|
| 50 |
+
|
| 51 |
+
service = PostgresVectorService(
|
| 52 |
+
connection_string=self.test_connection_string,
|
| 53 |
+
table_name=self.test_table_name,
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
texts = ["test document 1", "test document 2"]
|
| 57 |
+
embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
|
| 58 |
+
metadatas = [{"source": "test1"}, {"source": "test2"}]
|
| 59 |
+
|
| 60 |
+
doc_ids = service.add_documents(texts, embeddings, metadatas)
|
| 61 |
+
|
| 62 |
+
assert len(doc_ids) == 2
|
| 63 |
+
assert all(isinstance(doc_id, str) for doc_id in doc_ids)
|
| 64 |
+
|
| 65 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 66 |
+
def test_similarity_search(self, mock_connect):
|
| 67 |
+
"""Test similarity search."""
|
| 68 |
+
mock_conn = Mock()
|
| 69 |
+
mock_cursor = Mock()
|
| 70 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 71 |
+
|
| 72 |
+
# Mock search results
|
| 73 |
+
mock_cursor.fetchall.return_value = [
|
| 74 |
+
{
|
| 75 |
+
"id": 1,
|
| 76 |
+
"content": "test document",
|
| 77 |
+
"metadata": {"source": "test"},
|
| 78 |
+
"similarity_score": 0.85,
|
| 79 |
+
}
|
| 80 |
+
]
|
| 81 |
+
|
| 82 |
+
mock_connect.return_value = mock_conn
|
| 83 |
+
|
| 84 |
+
service = PostgresVectorService(
|
| 85 |
+
connection_string=self.test_connection_string,
|
| 86 |
+
table_name=self.test_table_name,
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
query_embedding = [0.1, 0.2, 0.3]
|
| 90 |
+
results = service.similarity_search(query_embedding, k=5)
|
| 91 |
+
|
| 92 |
+
assert len(results) == 1
|
| 93 |
+
assert results[0]["id"] == "1"
|
| 94 |
+
assert results[0]["content"] == "test document"
|
| 95 |
+
assert "similarity_score" in results[0]
|
| 96 |
+
|
| 97 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 98 |
+
def test_get_collection_info(self, mock_connect):
|
| 99 |
+
"""Test getting collection information."""
|
| 100 |
+
mock_conn = Mock()
|
| 101 |
+
mock_cursor = Mock()
|
| 102 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 103 |
+
|
| 104 |
+
# Mock collection info queries
|
| 105 |
+
mock_cursor.fetchone.side_effect = [
|
| 106 |
+
[100], # document count
|
| 107 |
+
["1.2 MB"], # table size
|
| 108 |
+
["embedding", "vector(384)"], # column info
|
| 109 |
+
]
|
| 110 |
+
|
| 111 |
+
mock_connect.return_value = mock_conn
|
| 112 |
+
|
| 113 |
+
service = PostgresVectorService(
|
| 114 |
+
connection_string=self.test_connection_string,
|
| 115 |
+
table_name=self.test_table_name,
|
| 116 |
+
)
|
| 117 |
+
service.dimension = 384 # Set dimension
|
| 118 |
+
|
| 119 |
+
info = service.get_collection_info()
|
| 120 |
+
|
| 121 |
+
assert info["document_count"] == 100
|
| 122 |
+
assert info["table_size"] == "1.2 MB"
|
| 123 |
+
assert info["embedding_dimension"] == 384
|
| 124 |
+
|
| 125 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 126 |
+
def test_delete_documents(self, mock_connect):
|
| 127 |
+
"""Test deleting specific documents."""
|
| 128 |
+
mock_conn = Mock()
|
| 129 |
+
mock_cursor = Mock()
|
| 130 |
+
mock_cursor.rowcount = 2
|
| 131 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 132 |
+
mock_connect.return_value = mock_conn
|
| 133 |
+
|
| 134 |
+
service = PostgresVectorService(
|
| 135 |
+
connection_string=self.test_connection_string,
|
| 136 |
+
table_name=self.test_table_name,
|
| 137 |
+
)
|
| 138 |
+
|
| 139 |
+
deleted_count = service.delete_documents(["1", "2"])
|
| 140 |
+
|
| 141 |
+
assert deleted_count == 2
|
| 142 |
+
|
| 143 |
+
@patch("src.vector_db.postgres_vector_service.psycopg2.connect")
|
| 144 |
+
def test_health_check(self, mock_connect):
|
| 145 |
+
"""Test health check functionality."""
|
| 146 |
+
mock_conn = Mock()
|
| 147 |
+
mock_cursor = Mock()
|
| 148 |
+
mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
|
| 149 |
+
|
| 150 |
+
# Mock health check queries
|
| 151 |
+
mock_cursor.fetchone.side_effect = [
|
| 152 |
+
[1], # SELECT 1
|
| 153 |
+
[True], # pgvector extension check
|
| 154 |
+
[10], # document count
|
| 155 |
+
["500 KB"], # table size
|
| 156 |
+
["embedding", "vector(384)"], # column info
|
| 157 |
+
]
|
| 158 |
+
|
| 159 |
+
mock_connect.return_value = mock_conn
|
| 160 |
+
|
| 161 |
+
service = PostgresVectorService(
|
| 162 |
+
connection_string=self.test_connection_string,
|
| 163 |
+
table_name=self.test_table_name,
|
| 164 |
+
)
|
| 165 |
+
service.dimension = 384
|
| 166 |
+
|
| 167 |
+
health = service.health_check()
|
| 168 |
+
|
| 169 |
+
assert health["status"] == "healthy"
|
| 170 |
+
assert health["pgvector_installed"] is True
|
| 171 |
+
assert health["connection"] == "ok"
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
class TestPostgresVectorAdapter:
|
| 175 |
+
"""Test PostgresVectorAdapter compatibility with ChromaDB interface."""
|
| 176 |
+
|
| 177 |
+
def setup_method(self):
|
| 178 |
+
"""Setup test fixtures."""
|
| 179 |
+
self.test_table_name = "test_embeddings"
|
| 180 |
+
|
| 181 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 182 |
+
def test_adapter_initialization(self, mock_service_class):
|
| 183 |
+
"""Test adapter initialization."""
|
| 184 |
+
mock_service = Mock()
|
| 185 |
+
mock_service_class.return_value = mock_service
|
| 186 |
+
|
| 187 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 188 |
+
|
| 189 |
+
assert adapter.collection_name == self.test_table_name
|
| 190 |
+
mock_service_class.assert_called_once_with(table_name=self.test_table_name)
|
| 191 |
+
|
| 192 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 193 |
+
def test_add_embeddings_chromadb_compatibility(self, mock_service_class):
|
| 194 |
+
"""Test add_embeddings method compatibility with ChromaDB interface."""
|
| 195 |
+
mock_service = Mock()
|
| 196 |
+
mock_service.add_documents.return_value = ["1", "2"]
|
| 197 |
+
mock_service_class.return_value = mock_service
|
| 198 |
+
|
| 199 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 200 |
+
|
| 201 |
+
embeddings = [[0.1, 0.2], [0.3, 0.4]]
|
| 202 |
+
chunk_ids = ["chunk1", "chunk2"]
|
| 203 |
+
documents = ["doc1", "doc2"]
|
| 204 |
+
metadatas = [{"source": "test1"}, {"source": "test2"}]
|
| 205 |
+
|
| 206 |
+
result = adapter.add_embeddings(embeddings, chunk_ids, documents, metadatas)
|
| 207 |
+
|
| 208 |
+
assert result is True
|
| 209 |
+
mock_service.add_documents.assert_called_once_with(
|
| 210 |
+
documents, embeddings, metadatas
|
| 211 |
+
)
|
| 212 |
+
|
| 213 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 214 |
+
def test_search_chromadb_compatibility(self, mock_service_class):
|
| 215 |
+
"""Test search method compatibility with ChromaDB interface."""
|
| 216 |
+
mock_service = Mock()
|
| 217 |
+
mock_service.similarity_search.return_value = [
|
| 218 |
+
{
|
| 219 |
+
"id": "1",
|
| 220 |
+
"content": "test document",
|
| 221 |
+
"metadata": {"source": "test"},
|
| 222 |
+
"similarity_score": 0.85,
|
| 223 |
+
}
|
| 224 |
+
]
|
| 225 |
+
mock_service_class.return_value = mock_service
|
| 226 |
+
|
| 227 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 228 |
+
|
| 229 |
+
query_embedding = [0.1, 0.2, 0.3]
|
| 230 |
+
results = adapter.search(query_embedding, top_k=5)
|
| 231 |
+
|
| 232 |
+
assert len(results) == 1
|
| 233 |
+
assert results[0]["id"] == "1"
|
| 234 |
+
assert results[0]["document"] == "test document"
|
| 235 |
+
assert results[0]["metadata"] == {"source": "test"}
|
| 236 |
+
assert "distance" in results[0]
|
| 237 |
+
assert results[0]["distance"] == pytest.approx(0.15) # 1.0 - 0.85
|
| 238 |
+
|
| 239 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 240 |
+
def test_get_count_chromadb_compatibility(self, mock_service_class):
|
| 241 |
+
"""Test get_count method compatibility with ChromaDB interface."""
|
| 242 |
+
mock_service = Mock()
|
| 243 |
+
mock_service.get_collection_info.return_value = {"document_count": 42}
|
| 244 |
+
mock_service_class.return_value = mock_service
|
| 245 |
+
|
| 246 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 247 |
+
|
| 248 |
+
count = adapter.get_count()
|
| 249 |
+
|
| 250 |
+
assert count == 42
|
| 251 |
+
|
| 252 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 253 |
+
def test_batch_operations(self, mock_service_class):
|
| 254 |
+
"""Test batch operations compatibility."""
|
| 255 |
+
mock_service = Mock()
|
| 256 |
+
mock_service.add_documents.return_value = ["1", "2"]
|
| 257 |
+
mock_service_class.return_value = mock_service
|
| 258 |
+
|
| 259 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 260 |
+
|
| 261 |
+
batch_embeddings = [[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6]]]
|
| 262 |
+
batch_chunk_ids = [["chunk1", "chunk2"], ["chunk3"]]
|
| 263 |
+
batch_documents = [["doc1", "doc2"], ["doc3"]]
|
| 264 |
+
batch_metadatas = [
|
| 265 |
+
[{"source": "test1"}, {"source": "test2"}],
|
| 266 |
+
[{"source": "test3"}],
|
| 267 |
+
]
|
| 268 |
+
|
| 269 |
+
total_added = adapter.add_embeddings_batch(
|
| 270 |
+
batch_embeddings, batch_chunk_ids, batch_documents, batch_metadatas
|
| 271 |
+
)
|
| 272 |
+
|
| 273 |
+
assert total_added == 3 # 2 + 1
|
| 274 |
+
assert mock_service.add_documents.call_count == 2
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
class TestVectorDatabaseFactory:
|
| 278 |
+
"""Test the vector database factory function."""
|
| 279 |
+
|
| 280 |
+
@patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "postgres"})
|
| 281 |
+
@patch("src.vector_store.vector_db.PostgresVectorAdapter")
|
| 282 |
+
def test_factory_creates_postgres_adapter(self, mock_adapter_class):
|
| 283 |
+
"""Test factory creates PostgreSQL adapter when configured."""
|
| 284 |
+
from src.vector_store.vector_db import create_vector_database
|
| 285 |
+
|
| 286 |
+
mock_adapter = Mock()
|
| 287 |
+
mock_adapter_class.return_value = mock_adapter
|
| 288 |
+
|
| 289 |
+
db = create_vector_database(collection_name="test_collection")
|
| 290 |
+
|
| 291 |
+
assert db == mock_adapter
|
| 292 |
+
mock_adapter_class.assert_called_once_with(table_name="test_collection")
|
| 293 |
+
|
| 294 |
+
@patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "chroma"})
|
| 295 |
+
@patch("src.vector_store.vector_db.VectorDatabase")
|
| 296 |
+
def test_factory_creates_chroma_database(self, mock_vector_db_class):
|
| 297 |
+
"""Test factory creates ChromaDB when configured."""
|
| 298 |
+
from src.vector_store.vector_db import create_vector_database
|
| 299 |
+
|
| 300 |
+
mock_db = Mock()
|
| 301 |
+
mock_vector_db_class.return_value = mock_db
|
| 302 |
+
|
| 303 |
+
db = create_vector_database(
|
| 304 |
+
persist_path="/test/path", collection_name="test_collection"
|
| 305 |
+
)
|
| 306 |
+
|
| 307 |
+
assert db == mock_db
|
| 308 |
+
mock_vector_db_class.assert_called_once_with(
|
| 309 |
+
persist_path="/test/path", collection_name="test_collection"
|
| 310 |
+
)
|
| 311 |
+
|
| 312 |
+
|
| 313 |
+
# Integration tests (require actual database)
|
| 314 |
+
@pytest.mark.integration
|
| 315 |
+
class TestPostgresIntegration:
|
| 316 |
+
"""Integration tests that require a real PostgreSQL database."""
|
| 317 |
+
|
| 318 |
+
@pytest.fixture
|
| 319 |
+
def postgres_service(self):
|
| 320 |
+
"""Create a PostgreSQL service for testing."""
|
| 321 |
+
# Only run if DATABASE_URL is set
|
| 322 |
+
database_url = os.getenv("TEST_DATABASE_URL")
|
| 323 |
+
if not database_url:
|
| 324 |
+
pytest.skip("TEST_DATABASE_URL not set")
|
| 325 |
+
|
| 326 |
+
service = PostgresVectorService(
|
| 327 |
+
connection_string=database_url, table_name="test_embeddings"
|
| 328 |
+
)
|
| 329 |
+
|
| 330 |
+
# Clean up before test
|
| 331 |
+
service.delete_all_documents()
|
| 332 |
+
|
| 333 |
+
yield service
|
| 334 |
+
|
| 335 |
+
# Clean up after test
|
| 336 |
+
try:
|
| 337 |
+
service.delete_all_documents()
|
| 338 |
+
except:
|
| 339 |
+
pass # Ignore cleanup errors
|
| 340 |
+
|
| 341 |
+
def test_full_workflow(self, postgres_service):
|
| 342 |
+
"""Test complete workflow with real database."""
|
| 343 |
+
# Add documents
|
| 344 |
+
texts = ["This is a test document.", "Another test document."]
|
| 345 |
+
embeddings = [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]]
|
| 346 |
+
metadatas = [{"source": "test1"}, {"source": "test2"}]
|
| 347 |
+
|
| 348 |
+
doc_ids = postgres_service.add_documents(texts, embeddings, metadatas)
|
| 349 |
+
assert len(doc_ids) == 2
|
| 350 |
+
|
| 351 |
+
# Search
|
| 352 |
+
query_embedding = [0.1, 0.2, 0.3, 0.4]
|
| 353 |
+
results = postgres_service.similarity_search(query_embedding, k=2)
|
| 354 |
+
|
| 355 |
+
assert len(results) <= 2
|
| 356 |
+
if results:
|
| 357 |
+
assert "content" in results[0]
|
| 358 |
+
assert "similarity_score" in results[0]
|
| 359 |
+
|
| 360 |
+
# Get info
|
| 361 |
+
info = postgres_service.get_collection_info()
|
| 362 |
+
assert info["document_count"] == 2
|
| 363 |
+
|
| 364 |
+
# Health check
|
| 365 |
+
health = postgres_service.health_check()
|
| 366 |
+
assert health["status"] == "healthy"
|
tests/test_vector_store/test_postgres_vector_simple.py
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Tests for PostgresVectorService and PostgresVectorAdapter (simplified).
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
from unittest.mock import Mock, patch
|
| 7 |
+
|
| 8 |
+
import pytest
|
| 9 |
+
|
| 10 |
+
from src.vector_db.postgres_adapter import PostgresVectorAdapter
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class TestPostgresVectorAdapter:
|
| 14 |
+
"""Test PostgresVectorAdapter compatibility."""
|
| 15 |
+
|
| 16 |
+
def setup_method(self):
|
| 17 |
+
"""Setup test fixtures."""
|
| 18 |
+
self.test_table_name = "test_embeddings"
|
| 19 |
+
|
| 20 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 21 |
+
def test_adapter_initialization(self, mock_service_class):
|
| 22 |
+
"""Test adapter initialization."""
|
| 23 |
+
mock_service = Mock()
|
| 24 |
+
mock_service_class.return_value = mock_service
|
| 25 |
+
|
| 26 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 27 |
+
|
| 28 |
+
assert adapter.collection_name == self.test_table_name
|
| 29 |
+
|
| 30 |
+
@patch("src.vector_db.postgres_adapter.PostgresVectorService")
|
| 31 |
+
def test_get_count_chromadb_compatibility(self, mock_service_class):
|
| 32 |
+
"""Test get_count method compatibility with ChromaDB interface."""
|
| 33 |
+
mock_service = Mock()
|
| 34 |
+
mock_service.get_collection_info.return_value = {"document_count": 42}
|
| 35 |
+
mock_service_class.return_value = mock_service
|
| 36 |
+
|
| 37 |
+
adapter = PostgresVectorAdapter(table_name=self.test_table_name)
|
| 38 |
+
|
| 39 |
+
count = adapter.get_count()
|
| 40 |
+
|
| 41 |
+
assert count == 42
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class TestVectorDatabaseFactory:
|
| 45 |
+
"""Test the vector database factory function."""
|
| 46 |
+
|
| 47 |
+
@patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "chroma"})
|
| 48 |
+
@patch("src.vector_store.vector_db.VectorDatabase")
|
| 49 |
+
def test_factory_creates_chroma_database(self, mock_vector_db_class):
|
| 50 |
+
"""Test factory creates ChromaDB when configured."""
|
| 51 |
+
from src.vector_store.vector_db import create_vector_database
|
| 52 |
+
|
| 53 |
+
mock_db = Mock()
|
| 54 |
+
mock_vector_db_class.return_value = mock_db
|
| 55 |
+
|
| 56 |
+
db = create_vector_database(
|
| 57 |
+
persist_path="/test/path", collection_name="test_collection"
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
assert db == mock_db
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
# Integration tests (require actual database)
|
| 64 |
+
@pytest.mark.integration
|
| 65 |
+
class TestPostgresIntegration:
|
| 66 |
+
"""Integration tests that require a real PostgreSQL database."""
|
| 67 |
+
|
| 68 |
+
def test_skip_integration(self):
|
| 69 |
+
"""Skip integration tests without database."""
|
| 70 |
+
database_url = os.getenv("TEST_DATABASE_URL")
|
| 71 |
+
if not database_url:
|
| 72 |
+
pytest.skip("TEST_DATABASE_URL not set")
|