Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

Seth McKnight commited on Oct 23

Commit

dca679b

1 Parent(s): 48155ff

Postgres vector migration (#83)

* feat: Implement PostgreSQL with pgvector as ChromaDB alternative

- Add PostgresVectorService with full pgvector integration
- Create PostgresVectorAdapter for ChromaDB compatibility
- Update config to support vector storage type selection
- Add factory pattern for seamless backend switching
- Include migration script with data optimization
- Add comprehensive tests for PostgreSQL implementation
- Update dependencies and environment configuration
- Expected memory reduction: 300-350MB (from 400MB+ to 50-150MB)

This enables deployment on Render's 512MB free tier by using persistent
PostgreSQL storage instead of in-memory ChromaDB.

* Add pgvector init script, update migration docs, and test adjustments

Files changed (12) hide show

POSTGRES_MIGRATION.md +214 -0
dev-requirements.txt +1 -0
requirements.txt +1 -0
scripts/init_pgvector.py +208 -0
scripts/migrate_to_postgres.py +434 -0
src/config.py +25 -6
src/vector_db/postgres_adapter.py +126 -0
src/vector_db/postgres_vector_service.py +473 -0
src/vector_store/vector_db.py +31 -1
tests/test_embedding/test_embedding_service.py +5 -4
tests/test_vector_store/test_postgres_vector.py +366 -0
tests/test_vector_store/test_postgres_vector_simple.py +72 -0

POSTGRES_MIGRATION.md ADDED Viewed

	@@ -0,0 +1,214 @@

+# PostgreSQL Migration Guide
+## Overview
+This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.
+## What's Been Implemented
+### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)
+- Full PostgreSQL integration with pgvector extension
+- Automatic table creation and indexing
+- Similarity search using cosine distance
+- Document CRUD operations
+- Health monitoring and collection info
+### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)
+- Compatibility layer for existing ChromaDB interface
+- Ensures seamless migration without code changes
+- Converts between PostgreSQL and ChromaDB result formats
+### 3. Updated Configuration (`src/config.py`)
+- Added `VECTOR_STORAGE_TYPE` environment variable
+- PostgreSQL connection settings
+- Memory optimization parameters
+### 4. Factory Pattern (`src/vector_store/vector_db.py`)
+- `create_vector_database()` function selects backend automatically
+- Supports both ChromaDB and PostgreSQL based on configuration
+### 5. Migration Script (`scripts/migrate_to_postgres.py`)
+- Data optimization (text summarization, metadata cleaning)
+- Batch processing with memory management
+- Handles 4GB → 1GB data reduction for free tier
+### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)
+- Unit tests with mocked dependencies
+- Integration tests for real database
+- Compatibility tests for ChromaDB interface
+## Setup Instructions
+### Step 1: Create Render PostgreSQL Database
+1. Go to Render Dashboard
+2. Create → PostgreSQL
+3. Choose "Free" plan (1GB storage, 30 days)
+4. Save the connection details
+### Step 2: Enable pgvector Extension
+You have several options to enable pgvector:
+**Option A: Use the initialization script (Recommended)**
+```bash
+# Set your database URL
+export DATABASE_URL="postgresql://user:password@host:port/database"
+# Run the initialization script
+python scripts/init_pgvector.py
+```
+**Option B: Manual SQL**
+Connect to your database and run:
+```sql
+CREATE EXTENSION IF NOT EXISTS vector;
+```
+**Option C: From Render Dashboard**
+1. Go to your PostgreSQL service → Info tab
+2. Use the "PSQL Command" to connect
+3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`
+The initialization script (`scripts/init_pgvector.py`) will:
+- Test database connection
+- Check PostgreSQL version compatibility (13+)
+- Install pgvector extension safely
+- Verify vector operations work correctly
+- Provide detailed logging and error messages
+### Step 3: Update Environment Variables
+Add to your Render environment variables:
+```bash
+DATABASE_URL=postgresql://username:password@host:port/database
+VECTOR_STORAGE_TYPE=postgres
+MEMORY_LIMIT_MB=400
+```
+### Step 4: Install Dependencies
+```bash
+pip install psycopg2-binary==2.9.7
+```
+### Step 5: Run Migration (Optional)
+If you have existing ChromaDB data:
+```bash
+python scripts/migrate_to_postgres.py --database-url="your-connection-string"
+```
+## Usage
+### Switch to PostgreSQL
+Set environment variable:
+```bash
+export VECTOR_STORAGE_TYPE=postgres
+```
+### Use in Code (No Changes Required!)
+```python
+from src.vector_store.vector_db import create_vector_database
+# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
+vector_db = create_vector_database()
+vector_db.add_embeddings(embeddings, ids, documents, metadatas)
+results = vector_db.search(query_embedding, top_k=5)
+```
+## Expected Memory Reduction
+| Component | Before (ChromaDB) | After (PostgreSQL) | Savings |
+|-----------|------------------|-------------------|---------|
+| Vector Storage | 200-300MB | 0MB (disk) | 200-300MB |
+| Embedding Model | 100MB | 50MB (smaller model) | 50MB |
+| Application Code | 50-100MB | 50-100MB | 0MB |
+| **Total** | **350-500MB** | **50-150MB** | **300-350MB** |
+## Migration Optimizations
+### Data Size Reduction
+- **Text Summarization**: Documents truncated to 1000 characters
+- **Metadata Cleaning**: Only essential fields kept
+- **Dimension Reduction**: Can use smaller embedding models
+- **Quality Filtering**: Skip very short or low-quality documents
+### Memory Management
+- **Batch Processing**: Process documents in small batches
+- **Garbage Collection**: Aggressive cleanup between operations
+- **Streaming**: Process data without loading everything into memory
+## Testing
+### Unit Tests
+```bash
+pytest tests/test_vector_store/test_postgres_vector.py -v
+```
+### Integration Tests (Requires Database)
+```bash
+export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
+pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
+```
+### Migration Test
+```bash
+python scripts/migrate_to_postgres.py --test-only
+```
+## Deployment
+### Local Development
+Keep using ChromaDB:
+```bash
+export VECTOR_STORAGE_TYPE=chroma
+```
+### Production (Render)
+Switch to PostgreSQL:
+```bash
+export VECTOR_STORAGE_TYPE=postgres
+export DATABASE_URL="your-render-postgres-url"
+```
+## Troubleshooting
+### Common Issues
+1. **"pgvector extension not found"**
+   - Run `CREATE EXTENSION vector;` in your database
+2. **Connection errors**
+   - Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
+   - Check firewall/network connectivity
+3. **Memory still high**
+   - Verify `VECTOR_STORAGE_TYPE=postgres`
+   - Check that old ChromaDB files aren't being loaded
+### Monitoring
+```python
+from src.vector_db.postgres_vector_service import PostgresVectorService
+service = PostgresVectorService()
+health = service.health_check()
+print(health)  # Shows connection status, document count, etc.
+```
+## Rollback Plan
+If issues occur, simply change back to ChromaDB:
+```bash
+export VECTOR_STORAGE_TYPE=chroma
+```
+The factory pattern ensures seamless switching between backends.
+## Performance Comparison
+| Operation | ChromaDB | PostgreSQL | Notes |
+|-----------|----------|------------|-------|
+| Insert | Fast | Medium | Network overhead |
+| Search | Very Fast | Fast | pgvector is optimized |
+| Memory | High | Low | Vectors stored on disk |
+| Persistence | File-based | Database | More reliable |
+| Scaling | Limited | Excellent | Can upgrade storage |
+## Next Steps
+1. Test locally with PostgreSQL
+2. Create Render PostgreSQL database
+3. Run migration script
+4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
+5. Monitor memory usage in production

dev-requirements.txt CHANGED Viewed

@@ -3,3 +3,4 @@ black>=25.0.0
 isort==5.13.0
 flake8==6.1.0
 psutil

 isort==5.13.0
 flake8==6.1.0
 psutil
+psycopg2-binary==2.9.7

requirements.txt CHANGED Viewed

@@ -5,6 +5,7 @@ gunicorn==22.0.0
 # Vector database and embeddings
 chromadb==0.4.24
 sentence-transformers==2.7.0
 # Core dependencies (pinned for reproducibility, Python 3.12 compatible)
 numpy==1.26.4

 # Vector database and embeddings
 chromadb==0.4.24
 sentence-transformers==2.7.0
+psycopg2-binary==2.9.7
 # Core dependencies (pinned for reproducibility, Python 3.12 compatible)
 numpy==1.26.4

scripts/init_pgvector.py ADDED Viewed

	@@ -0,0 +1,208 @@

+#!/usr/bin/env python3
+"""
+Initialize pgvector extension in PostgreSQL database.
+This script connects to the database specified by DATABASE_URL environment variable
+and enables the pgvector extension if not already installed.
+Usage:
+    python scripts/init_pgvector.py
+Environment Variables:
+    DATABASE_URL: PostgreSQL connection string (required)
+Exit Codes:
+    0: Success - pgvector extension is installed and working
+    1: Error - connection failed, extension installation failed, or other error
+"""
+import logging
+import os
+import sys
+import psycopg2  # type: ignore
+import psycopg2.extras  # type: ignore
+def setup_logging() -> logging.Logger:
+    """Setup logging configuration."""
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s - %(levelname)s - %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+    return logging.getLogger(__name__)
+def get_database_url() -> str:
+    """Get DATABASE_URL from environment."""
+    database_url = os.getenv("DATABASE_URL")
+    if not database_url:
+        raise ValueError("DATABASE_URL environment variable is required")
+    return database_url
+def test_connection(connection_string: str, logger: logging.Logger) -> bool:
+    """Test database connection."""
+    try:
+        with psycopg2.connect(connection_string) as conn:
+            with conn.cursor() as cur:
+                cur.execute("SELECT 1;")
+                result = cur.fetchone()
+                if result and result[0] == 1:
+                    logger.info("✅ Database connection successful")
+                    return True
+                else:
+                    logger.error("❌ Unexpected result from connection test")
+                    return False
+    except Exception as e:
+        logger.error(f"❌ Database connection failed: {e}")
+        return False
+def check_postgresql_version(connection_string: str, logger: logging.Logger) -> bool:
+    """Check if PostgreSQL version supports pgvector (13+)."""
+    try:
+        with psycopg2.connect(connection_string) as conn:
+            with conn.cursor() as cur:
+                cur.execute("SELECT version();")
+                result = cur.fetchone()
+                if not result:
+                    logger.error("❌ Could not get PostgreSQL version")
+                    return False
+                version_string = str(result[0])
+                # Extract major version number
+                # Format: "PostgreSQL 15.4 on x86_64-pc-linux-gnu..."
+                version_parts = version_string.split()
+                if len(version_parts) >= 2:
+                    version_number = version_parts[1].split(".")[0]
+                    major_version = int(version_number)
+                    if major_version >= 13:
+                        logger.info(
+                            f"✅ PostgreSQL version {major_version} supports pgvector"
+                        )
+                        return True
+                    else:
+                        logger.error(
+                            "❌ PostgreSQL version %s is too old (requires 13+)",
+                            major_version,
+                        )
+                        return False
+                else:
+                    logger.warning(
+                        f"⚠️  Could not parse PostgreSQL version: {version_string}"
+                    )
+                    return True  # Proceed anyway
+    except Exception as e:
+        logger.error(f"❌ Failed to check PostgreSQL version: {e}")
+        return False
+def install_pgvector_extension(connection_string: str, logger: logging.Logger) -> bool:
+    """Install pgvector extension."""
+    try:
+        with psycopg2.connect(connection_string) as conn:
+            conn.autocommit = True  # Required for CREATE EXTENSION
+            with conn.cursor() as cur:
+                logger.info("Installing pgvector extension...")
+                cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
+                logger.info("✅ pgvector extension installed successfully")
+                return True
+    except psycopg2.errors.InsufficientPrivilege as e:
+        logger.error("❌ Insufficient privileges to install extension: %s", str(e))
+        logger.error(
+            "Make sure your database user has CREATE privilege or is a superuser"
+        )
+        return False
+    except Exception as e:
+        logger.error(f"❌ Failed to install pgvector extension: {e}")
+        return False
+def verify_pgvector_installation(
+    connection_string: str, logger: logging.Logger
+) -> bool:
+    """Verify pgvector extension is properly installed."""
+    try:
+        with psycopg2.connect(connection_string) as conn:
+            with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cur:
+                # Check extension is installed
+                cur.execute(
+                    "SELECT extname, extversion FROM pg_extension "
+                    "WHERE extname = 'vector';"
+                )
+                result = cur.fetchone()
+                if not result:
+                    logger.error("❌ pgvector extension not found in pg_extension")
+                    return False
+                logger.info(f"✅ pgvector extension version: {result['extversion']}")
+                # Test basic vector functionality
+                cur.execute("SELECT '[1,2,3]'::vector(3);")
+                vector_result = cur.fetchone()
+                if vector_result:
+                    logger.info("✅ Vector type functioning correctly")
+                else:
+                    logger.error("❌ Vector type test failed")
+                    return False
+                # Test vector operations
+                cur.execute("SELECT '[1,2,3]'::vector(3) <-> '[1,2,4]'::vector(3);")
+                distance_result = cur.fetchone()
+                if distance_result and distance_result[0] == 1.0:
+                    logger.info("✅ Vector distance operations working")
+                    return True
+                else:
+                    logger.error("❌ Vector distance operations failed")
+                    return False
+    except Exception as e:
+        logger.error(f"❌ Failed to verify pgvector installation: {e}")
+        return False
+def main() -> int:
+    """Main function."""
+    logger = setup_logging()
+    try:
+        logger.info("🚀 Starting pgvector initialization...")
+        # Get database connection string
+        database_url = get_database_url()
+        logger.info("📡 Got DATABASE_URL from environment")
+        # Test connection
+        if not test_connection(database_url, logger):
+            return 1
+        # Check PostgreSQL version
+        if not check_postgresql_version(database_url, logger):
+            return 1
+        # Install pgvector extension
+        if not install_pgvector_extension(database_url, logger):
+            return 1
+        # Verify installation
+        if not verify_pgvector_installation(database_url, logger):
+            return 1
+        logger.info("🎉 pgvector initialization completed successfully!")
+        logger.info("   Your PostgreSQL database is now ready for vector operations.")
+        return 0
+    except Exception as e:
+        logger.error(f"❌ Unexpected error: {e}")
+        return 1
+if __name__ == "__main__":
+    sys.exit(main())

scripts/migrate_to_postgres.py ADDED Viewed

	@@ -0,0 +1,434 @@

+#!/usr/bin/env python3
+"""
+Migration script to move data from ChromaDB to PostgreSQL with data optimization.
+This script reduces data size to fit within Render's 1GB PostgreSQL free tier limit.
+"""
+import gc
+import logging
+import os
+import re
+import sys
+from typing import Any, Dict, List, Optional
+# Add the src directory to the path
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+from src.config import (
+    COLLECTION_NAME,
+    MAX_DOCUMENT_LENGTH,
+    MAX_DOCUMENTS_IN_MEMORY,
+    VECTOR_DB_PERSIST_PATH,
+)
+from src.embedding.embedding_service import EmbeddingService
+from src.vector_db.postgres_vector_service import PostgresVectorService
+from src.vector_store.vector_db import VectorDatabase
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+class DataOptimizer:
+    """Optimizes document data to reduce storage requirements."""
+    @staticmethod
+    def summarize_text(text: str, max_length: int = MAX_DOCUMENT_LENGTH) -> str:
+        """
+        Summarize text to reduce storage while preserving key information.
+        Args:
+            text: Original text
+            max_length: Maximum length for summarized text
+        Returns:
+            Summarized text
+        """
+        if len(text) <= max_length:
+            return text.strip()
+        # Simple extractive summarization: keep first few sentences
+        sentences = re.split(r"[.!?]+", text)
+        summary = ""
+        for sentence in sentences:
+            sentence = sentence.strip()
+            if not sentence:
+                continue
+            # Check if adding this sentence would exceed limit
+            if len(summary + sentence + ".") > max_length:
+                break
+            summary += sentence + ". "
+        # If summary is too short, take first max_length characters
+        if len(summary) < max_length // 4:
+            summary = text[:max_length].strip()
+        return summary.strip()
+    @staticmethod
+    def clean_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Clean metadata to keep only essential fields.
+        Args:
+            metadata: Original metadata
+        Returns:
+            Cleaned metadata with only essential fields
+        """
+        essential_fields = {
+            "source",
+            "title",
+            "page",
+            "chunk_id",
+            "document_type",
+            "created_at",
+            "file_path",
+            "section",
+        }
+        cleaned = {}
+        for key, value in metadata.items():
+            if key in essential_fields and value is not None:
+                # Convert to simple types and truncate long strings
+                if isinstance(value, str) and len(value) > 100:
+                    cleaned[key] = value[:100]
+                elif isinstance(value, (str, int, float, bool)):
+                    cleaned[key] = value
+        return cleaned
+    @staticmethod
+    def should_include_document(metadata: Dict[str, Any], content: str) -> bool:
+        """
+        Decide whether to include a document based on quality metrics.
+        Args:
+            metadata: Document metadata
+            content: Document content
+        Returns:
+            True if document should be included
+        """
+        # Skip very short documents (likely not useful)
+        if len(content.strip()) < 50:
+            return False
+        # Skip documents with no meaningful content
+        if not re.search(r"[a-zA-Z]{3,}", content):
+            return False
+        # Prioritize certain document types if available
+        doc_type = metadata.get("document_type", "").lower()
+        if doc_type in ["policy", "procedure", "guideline"]:
+            return True
+        return True
+class ChromaToPostgresMigrator:
+    """Migrates data from ChromaDB to PostgreSQL with optimization."""
+    def __init__(self, database_url: Optional[str] = None):
+        """
+        Initialize the migrator.
+        Args:
+            database_url: PostgreSQL connection string
+        """
+        self.database_url = database_url or os.getenv("DATABASE_URL")
+        if not self.database_url:
+            raise ValueError("DATABASE_URL environment variable is required")
+        self.optimizer = DataOptimizer()
+        self.embedding_service = None
+        self.total_migrated = 0
+        self.total_skipped = 0
+    def initialize_services(self):
+        """Initialize embedding service and database connections."""
+        logger.info("Initializing services...")
+        # Initialize embedding service
+        self.embedding_service = EmbeddingService()
+        # Initialize ChromaDB (source)
+        self.chroma_db = VectorDatabase(
+            persist_path=VECTOR_DB_PERSIST_PATH, collection_name=COLLECTION_NAME
+        )
+        # Initialize PostgreSQL (destination)
+        self.postgres_service = PostgresVectorService(
+            connection_string=self.database_url, table_name=COLLECTION_NAME
+        )
+        logger.info("Services initialized successfully")
+    def get_chroma_documents(
+        self, batch_size: int = MAX_DOCUMENTS_IN_MEMORY
+    ) -> List[Dict[str, Any]]:
+        """
+        Retrieve all documents from ChromaDB in batches.
+        Args:
+            batch_size: Number of documents to retrieve per batch
+        Yields:
+            Batches of documents
+        """
+        try:
+            total_count = self.chroma_db.get_count()
+            logger.info(f"Found {total_count} documents in ChromaDB")
+            if total_count == 0:
+                return
+            # Get all documents (ChromaDB doesn't have native pagination)
+            collection = self.chroma_db.get_collection()
+            all_data = collection.get(include=["documents", "metadatas", "embeddings"])
+            if not all_data or not all_data.get("documents"):
+                logger.warning("No documents found in ChromaDB collection")
+                return
+            # Process in batches
+            documents = all_data["documents"]
+            metadatas = all_data.get("metadatas", [{}] * len(documents))
+            embeddings = all_data.get("embeddings", [])
+            ids = all_data.get("ids", [])
+            for i in range(0, len(documents), batch_size):
+                batch_end = min(i + batch_size, len(documents))
+                batch_docs = documents[i:batch_end]
+                batch_metadata = (
+                    metadatas[i:batch_end] if metadatas else [{}] * len(batch_docs)
+                )
+                batch_embeddings = embeddings[i:batch_end] if embeddings else []
+                batch_ids = ids[i:batch_end] if ids else []
+                yield {
+                    "documents": batch_docs,
+                    "metadatas": batch_metadata,
+                    "embeddings": batch_embeddings,
+                    "ids": batch_ids,
+                }
+        except Exception as e:
+            logger.error(f"Error retrieving ChromaDB documents: {e}")
+            raise
+    def process_batch(self, batch: Dict[str, Any]) -> Dict[str, int]:
+        """
+        Process a batch of documents with optimization.
+        Args:
+            batch: Batch of documents from ChromaDB
+        Returns:
+            Dictionary with processing statistics
+        """
+        documents = batch["documents"]
+        metadatas = batch["metadatas"]
+        embeddings = batch["embeddings"]
+        processed_docs = []
+        processed_metadata = []
+        processed_embeddings = []
+        stats = {"processed": 0, "skipped": 0, "reembedded": 0}
+        for i, (doc, metadata) in enumerate(zip(documents, metadatas)):
+            # Clean and optimize document
+            cleaned_metadata = self.optimizer.clean_metadata(metadata or {})
+            # Check if we should include this document
+            if not self.optimizer.should_include_document(cleaned_metadata, doc):
+                stats["skipped"] += 1
+                continue
+            # Summarize document content
+            summarized_doc = self.optimizer.summarize_text(doc)
+            # Use existing embedding if available and document wasn't changed much
+            if embeddings and i < len(embeddings) and len(doc) == len(summarized_doc):
+                # Document unchanged, use existing embedding
+                embedding = embeddings[i]
+            else:
+                # Document changed, need new embedding
+                try:
+                    embedding = self.embedding_service.generate_embeddings(
+                        [summarized_doc]
+                    )[0]
+                    stats["reembedded"] += 1
+                except Exception as e:
+                    logger.warning(
+                        f"Failed to generate embedding for document {i}: {e}"
+                    )
+                    stats["skipped"] += 1
+                    continue
+            processed_docs.append(summarized_doc)
+            processed_metadata.append(cleaned_metadata)
+            processed_embeddings.append(embedding)
+            stats["processed"] += 1
+        # Add processed documents to PostgreSQL
+        if processed_docs:
+            try:
+                doc_ids = self.postgres_service.add_documents(
+                    texts=processed_docs,
+                    embeddings=processed_embeddings,
+                    metadatas=processed_metadata,
+                )
+                logger.info(f"Added {len(doc_ids)} documents to PostgreSQL")
+            except Exception as e:
+                logger.error(f"Failed to add documents to PostgreSQL: {e}")
+                raise
+        # Force garbage collection
+        gc.collect()
+        return stats
+    def migrate(self) -> Dict[str, int]:
+        """
+        Perform the complete migration.
+        Returns:
+            Migration statistics
+        """
+        logger.info("Starting ChromaDB to PostgreSQL migration...")
+        self.initialize_services()
+        # Clear existing PostgreSQL data
+        logger.info("Clearing existing PostgreSQL data...")
+        deleted_count = self.postgres_service.delete_all_documents()
+        logger.info(f"Deleted {deleted_count} existing documents from PostgreSQL")
+        total_stats = {"processed": 0, "skipped": 0, "reembedded": 0}
+        batch_count = 0
+        try:
+            # Process documents in batches
+            for batch in self.get_chroma_documents():
+                batch_count += 1
+                logger.info(f"Processing batch {batch_count}...")
+                batch_stats = self.process_batch(batch)
+                # Update totals
+                for key in total_stats:
+                    total_stats[key] += batch_stats[key]
+                logger.info(f"Batch {batch_count} complete: {batch_stats}")
+                # Memory cleanup between batches
+                gc.collect()
+            # Final statistics
+            logger.info("Migration completed successfully!")
+            logger.info(f"Final statistics: {total_stats}")
+            # Verify migration
+            postgres_info = self.postgres_service.get_collection_info()
+            logger.info(f"PostgreSQL collection info: {postgres_info}")
+            return total_stats
+        except Exception as e:
+            logger.error(f"Migration failed: {e}")
+            raise
+    def test_migration(self, test_query: str = "policy") -> Dict[str, Any]:
+        """
+        Test the migrated data by performing a search.
+        Args:
+            test_query: Query to test with
+        Returns:
+            Test results
+        """
+        logger.info(f"Testing migration with query: '{test_query}'")
+        try:
+            # Generate query embedding
+            query_embedding = self.embedding_service.generate_embeddings([test_query])[
+                0
+            ]
+            # Search PostgreSQL
+            results = self.postgres_service.similarity_search(query_embedding, k=5)
+            logger.info(f"Test search returned {len(results)} results")
+            for i, result in enumerate(results):
+                logger.info(
+                    f"Result {i+1}: {result['content'][:100]}... (score: {result.get('similarity_score', 0):.3f})"
+                )
+            return {
+                "query": test_query,
+                "results_count": len(results),
+                "results": results,
+            }
+        except Exception as e:
+            logger.error(f"Migration test failed: {e}")
+            return {"error": str(e)}
+def main():
+    """Main migration function."""
+    import argparse
+    parser = argparse.ArgumentParser(description="Migrate ChromaDB to PostgreSQL")
+    parser.add_argument("--database-url", help="PostgreSQL connection URL")
+    parser.add_argument(
+        "--test-only", action="store_true", help="Only run migration test"
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Show what would be migrated without actually migrating",
+    )
+    args = parser.parse_args()
+    try:
+        migrator = ChromaToPostgresMigrator(database_url=args.database_url)
+        if args.test_only:
+            # Only test existing migration
+            migrator.initialize_services()
+            results = migrator.test_migration()
+            print(f"Test results: {results}")
+        elif args.dry_run:
+            # Show what would be migrated
+            migrator.initialize_services()
+            total_docs = migrator.chroma_db.get_count()
+            logger.info(
+                f"Would migrate {total_docs} documents from ChromaDB to PostgreSQL"
+            )
+        else:
+            # Perform actual migration
+            stats = migrator.migrate()
+            logger.info(f"Migration complete: {stats}")
+            # Test the migration
+            test_results = migrator.test_migration()
+            logger.info(f"Migration test: {test_results}")
+    except Exception as e:
+        logger.error(f"Migration script failed: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

src/config.py CHANGED Viewed

@@ -1,5 +1,7 @@
 """Configuration settings for the ingestion pipeline"""
 # Default ingestion settings
 DEFAULT_CHUNK_SIZE = 1000
 DEFAULT_OVERLAP = 200
@@ -12,25 +14,42 @@ SUPPORTED_FORMATS = {".txt", ".md", ".markdown"}
 CORPUS_DIRECTORY = "synthetic_policies"
 # Vector Database Settings
-VECTOR_DB_PERSIST_PATH = "data/chroma_db"
 COLLECTION_NAME = "policy_documents"
 EMBEDDING_DIMENSION = 384  # paraphrase-MiniLM-L3-v2 (smaller, memory-efficient)
 SIMILARITY_METRIC = "cosine"
-# ChromaDB Configuration for Memory Optimization
 CHROMA_SETTINGS = {
     "anonymized_telemetry": False,
     "allow_reset": False,
 }
 # Embedding Model Settings
-# Embedding Model Settings
-EMBEDDING_MODEL_NAME = (
-    "all-MiniLM-L12-v2"  # Ultra-lightweight model (384 dim, minimal memory)
-)
 EMBEDDING_BATCH_SIZE = 1  # Absolute minimum for extreme memory constraints
 EMBEDDING_DEVICE = "cpu"  # Use CPU for free tier compatibility
 # Search Settings
 DEFAULT_TOP_K = 5
 MAX_TOP_K = 20

 """Configuration settings for the ingestion pipeline"""
+import os
 # Default ingestion settings
 DEFAULT_CHUNK_SIZE = 1000
 DEFAULT_OVERLAP = 200
 CORPUS_DIRECTORY = "synthetic_policies"
 # Vector Database Settings
+VECTOR_STORAGE_TYPE = os.getenv(
+    "VECTOR_STORAGE_TYPE", "chroma"
+)  # "chroma" or "postgres"
+VECTOR_DB_PERSIST_PATH = "data/chroma_db"  # Used for ChromaDB
+DATABASE_URL = os.getenv("DATABASE_URL")  # Used for PostgreSQL
 COLLECTION_NAME = "policy_documents"
 EMBEDDING_DIMENSION = 384  # paraphrase-MiniLM-L3-v2 (smaller, memory-efficient)
 SIMILARITY_METRIC = "cosine"
+# ChromaDB Configuration for Memory Optimization (when using ChromaDB)
 CHROMA_SETTINGS = {
     "anonymized_telemetry": False,
     "allow_reset": False,
 }
+# PostgreSQL Configuration (when using PostgreSQL)
+POSTGRES_TABLE_NAME = "document_embeddings"
+POSTGRES_MAX_CONNECTIONS = 10
 # Embedding Model Settings
+EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # Ultra-lightweight
 EMBEDDING_BATCH_SIZE = 1  # Absolute minimum for extreme memory constraints
 EMBEDDING_DEVICE = "cpu"  # Use CPU for free tier compatibility
+# Document Processing Settings (for memory optimization)
+MAX_DOCUMENT_LENGTH = 1000  # Truncate documents to reduce memory usage
+MAX_DOCUMENTS_IN_MEMORY = 100  # Process documents in small batches
+# Memory Management Settings
+ENABLE_MEMORY_MONITORING = (
+    os.getenv("ENABLE_MEMORY_MONITORING", "true").lower() == "true"
+)
+MEMORY_LIMIT_MB = int(
+    os.getenv("MEMORY_LIMIT_MB", "400")
+)  # Conservative limit for 512MB instances
 # Search Settings
 DEFAULT_TOP_K = 5
 MAX_TOP_K = 20

src/vector_db/postgres_adapter.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""
+Adapter to make PostgresVectorService compatible with the existing VectorDatabase interface.
+"""
+import logging
+from typing import Any, Dict, List
+from src.vector_db.postgres_vector_service import PostgresVectorService
+logger = logging.getLogger(__name__)
+class PostgresVectorAdapter:
+    """Adapter to make PostgresVectorService compatible with VectorDatabase interface."""
+    def __init__(self, table_name: str = "document_embeddings"):
+        """Initialize the PostgreSQL vector adapter."""
+        self.service = PostgresVectorService(table_name=table_name)
+        self.collection_name = table_name
+    def add_embeddings_batch(
+        self,
+        batch_embeddings: List[List[List[float]]],
+        batch_chunk_ids: List[List[str]],
+        batch_documents: List[List[str]],
+        batch_metadatas: List[List[Dict[str, Any]]],
+    ) -> int:
+        """Add embeddings in batches - compatible with ChromaDB interface."""
+        total_added = 0
+        for embeddings, chunk_ids, documents, metadatas in zip(
+            batch_embeddings, batch_chunk_ids, batch_documents, batch_metadatas
+        ):
+            added = self.add_embeddings(embeddings, chunk_ids, documents, metadatas)
+            if isinstance(added, bool) and added:
+                total_added += len(embeddings)
+            elif isinstance(added, int):
+                total_added += added
+        return total_added
+    def add_embeddings(
+        self,
+        embeddings: List[List[float]],
+        chunk_ids: List[str],
+        documents: List[str],
+        metadatas: List[Dict[str, Any]],
+    ) -> bool:
+        """Add embeddings to PostgreSQL - compatible with ChromaDB interface."""
+        try:
+            doc_ids = self.service.add_documents(documents, embeddings, metadatas)
+            return len(doc_ids) == len(embeddings)
+        except Exception as e:
+            logger.error(f"Failed to add embeddings: {e}")
+            raise
+    def search(
+        self, query_embedding: List[float], top_k: int = 5
+    ) -> List[Dict[str, Any]]:
+        """Search for similar embeddings - compatible with ChromaDB interface."""
+        try:
+            results = self.service.similarity_search(query_embedding, k=top_k)
+            # Convert PostgreSQL results to ChromaDB-compatible format
+            formatted_results = []
+            for i, result in enumerate(results):
+                formatted_result = {
+                    "id": result["id"],
+                    "document": result["content"],
+                    "metadata": result["metadata"],
+                    "distance": 1.0
+                    - result.get(
+                        "similarity_score", 0.0
+                    ),  # Convert similarity to distance
+                }
+                formatted_results.append(formatted_result)
+            return formatted_results
+        except Exception as e:
+            logger.error(f"Search failed: {e}")
+            return []
+    def get_count(self) -> int:
+        """Get the number of embeddings in the collection."""
+        try:
+            info = self.service.get_collection_info()
+            return info.get("document_count", 0)
+        except Exception as e:
+            logger.error(f"Failed to get count: {e}")
+            return 0
+    def delete_collection(self) -> bool:
+        """Delete all documents from the collection."""
+        try:
+            deleted_count = self.service.delete_all_documents()
+            return deleted_count >= 0
+        except Exception as e:
+            logger.error(f"Failed to delete collection: {e}")
+            return False
+    def reset_collection(self) -> bool:
+        """Reset the collection (delete all documents)."""
+        return self.delete_collection()
+    def get_collection(self):
+        """Get the underlying service (for compatibility)."""
+        return self.service
+    def get_embedding_dimension(self) -> int:
+        """Get the embedding dimension."""
+        try:
+            info = self.service.get_collection_info()
+            return info.get("embedding_dimension", 0) or 0
+        except Exception as e:
+            logger.error(f"Failed to get embedding dimension: {e}")
+            return 0
+    def has_valid_embeddings(self, expected_dimension: int) -> bool:
+        """Check if the collection has embeddings with the expected dimension."""
+        try:
+            actual_dimension = self.get_embedding_dimension()
+            return actual_dimension == expected_dimension and actual_dimension > 0
+        except Exception as e:
+            logger.error(f"Failed to validate embeddings: {e}")
+            return False

src/vector_db/postgres_vector_service.py ADDED Viewed

	@@ -0,0 +1,473 @@

+"""
+PostgreSQL vector database service using pgvector extension.
+This service provides persistent vector storage with efficient similarity search.
+"""
+import json
+import logging
+import os
+from contextlib import contextmanager
+from typing import Any, Dict, List, Optional
+import numpy as np
+import psycopg2
+import psycopg2.extras
+logger = logging.getLogger(__name__)
+class PostgresVectorService:
+    """Vector database service using PostgreSQL with pgvector extension."""
+    def __init__(
+        self,
+        connection_string: Optional[str] = None,
+        table_name: str = "document_embeddings",
+    ):
+        """
+        Initialize PostgreSQL vector service.
+        Args:
+            connection_string: PostgreSQL connection string. If None, uses DATABASE_URL env var.
+            table_name: Name of the table to store embeddings.
+        """
+        self.connection_string = connection_string or os.getenv("DATABASE_URL")
+        if not self.connection_string:
+            raise ValueError("DATABASE_URL environment variable is required")
+        self.table_name = table_name
+        self.dimension = None  # Will be set based on first embedding
+        # Test connection and create table
+        self._initialize_database()
+    @contextmanager
+    def _get_connection(self):
+        """Context manager for database connections."""
+        conn = None
+        try:
+            conn = psycopg2.connect(self.connection_string)
+            yield conn
+        except Exception as e:
+            if conn:
+                conn.rollback()
+            logger.error(f"Database connection error: {e}")
+            raise
+        finally:
+            if conn:
+                conn.close()
+    def _initialize_database(self):
+        """Initialize database with required extensions and tables."""
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                # Enable pgvector extension
+                cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
+                # Create table with initial structure (dimension will be added later)
+                cur.execute(
+                    f"""
+                    CREATE TABLE IF NOT EXISTS {self.table_name} (
+                        id SERIAL PRIMARY KEY,
+                        content TEXT NOT NULL,
+                        embedding vector,
+                        metadata JSONB DEFAULT '{{}}',
+                        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+                        updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+                    );
+                """
+                )
+                # Create index for text search
+                cur.execute(
+                    f"""
+                    CREATE INDEX IF NOT EXISTS idx_{self.table_name}_content
+                    ON {self.table_name} USING gin(to_tsvector('english', content));
+                """
+                )
+                conn.commit()
+                logger.info(f"Database initialized with table: {self.table_name}")
+    def _ensure_embedding_dimension(self, dimension: int):
+        """Ensure the embedding column has the correct dimension."""
+        if self.dimension == dimension:
+            return
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                # Check if we need to alter the table
+                cur.execute(
+                    f"""
+                    SELECT column_name, data_type, character_maximum_length
+                    FROM information_schema.columns
+                    WHERE table_name = %s AND column_name = 'embedding';
+                """,
+                    (self.table_name,),
+                )
+                result = cur.fetchone()
+                if result and f"vector({dimension})" not in str(result):
+                    # Drop existing index if it exists
+                    cur.execute(
+                        f"DROP INDEX IF EXISTS idx_{self.table_name}_embedding_cosine;"
+                    )
+                    # Alter column to correct dimension
+                    cur.execute(
+                        f"ALTER TABLE {self.table_name} ALTER COLUMN embedding TYPE vector({dimension});"
+                    )
+                    # Create optimized index for similarity search
+                    cur.execute(
+                        f"""
+                        CREATE INDEX IF NOT EXISTS idx_{self.table_name}_embedding_cosine
+                        ON {self.table_name}
+                        USING ivfflat (embedding vector_cosine_ops)
+                        WITH (lists = 100);
+                    """
+                    )
+                    conn.commit()
+                    logger.info(f"Updated embedding dimension to {dimension}")
+                self.dimension = dimension
+    def add_documents(
+        self,
+        texts: List[str],
+        embeddings: List[List[float]],
+        metadatas: Optional[List[Dict[str, Any]]] = None,
+    ) -> List[str]:
+        """
+        Add documents with their embeddings to the database.
+        Args:
+            texts: List of document texts
+            embeddings: List of embedding vectors
+            metadatas: Optional list of metadata dictionaries
+        Returns:
+            List of document IDs
+        """
+        if not texts or not embeddings:
+            return []
+        if len(texts) != len(embeddings):
+            raise ValueError("Number of texts must match number of embeddings")
+        if metadatas and len(metadatas) != len(texts):
+            raise ValueError("Number of metadatas must match number of texts")
+        # Ensure embedding dimension is set
+        if embeddings:
+            self._ensure_embedding_dimension(len(embeddings[0]))
+        # Default empty metadata if not provided
+        if metadatas is None:
+            metadatas = [{}] * len(texts)
+        document_ids = []
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                for text, embedding, metadata in zip(texts, embeddings, metadatas):
+                    # Insert document and get ID
+                    cur.execute(
+                        f"""
+                        INSERT INTO {self.table_name} (content, embedding, metadata)
+                        VALUES (%s, %s, %s)
+                        RETURNING id;
+                    """,
+                        (text, embedding, psycopg2.extras.Json(metadata)),
+                    )
+                    doc_id = cur.fetchone()[0]
+                    document_ids.append(str(doc_id))
+                conn.commit()
+                logger.info(f"Added {len(document_ids)} documents to database")
+        return document_ids
+    def similarity_search(
+        self,
+        query_embedding: List[float],
+        k: int = 5,
+        filter_metadata: Optional[Dict[str, Any]] = None,
+    ) -> List[Dict]:
+        """
+        Perform similarity search using cosine distance.
+        Args:
+            query_embedding: Query embedding vector
+            k: Number of results to return
+            filter_metadata: Optional metadata filters
+        Returns:
+            List of documents with similarity scores
+        """
+        if not query_embedding:
+            return []
+        # Build WHERE clause for metadata filtering
+        where_clause = ""
+        params = [query_embedding, query_embedding, k]
+        if filter_metadata:
+            conditions = []
+            for key, value in filter_metadata.items():
+                if isinstance(value, str):
+                    conditions.append(f"metadata->>%s = %s")
+                    params.insert(-1, key)
+                    params.insert(-1, value)
+                elif isinstance(value, (int, float)):
+                    conditions.append(f"(metadata->>%s)::numeric = %s")
+                    params.insert(-1, key)
+                    params.insert(-1, value)
+            if conditions:
+                where_clause = "WHERE " + " AND ".join(conditions)
+        query = f"""
+            SELECT id, content, metadata,
+                   1 - (embedding <=> %s) as similarity_score
+            FROM {self.table_name}
+            {where_clause}
+            ORDER BY embedding <=> %s
+            LIMIT %s;
+        """
+        with self._get_connection() as conn:
+            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
+                cur.execute(query, params)
+                results = cur.fetchall()
+                return [
+                    {
+                        "id": str(row["id"]),
+                        "content": row["content"],
+                        "metadata": row["metadata"] or {},
+                        "similarity_score": float(row["similarity_score"]),
+                    }
+                    for row in results
+                ]
+    def get_collection_info(self) -> Dict[str, Any]:
+        """Get information about the vector collection."""
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                # Get document count
+                cur.execute(f"SELECT COUNT(*) FROM {self.table_name}")
+                doc_count = cur.fetchone()[0]
+                # Get table size
+                cur.execute(
+                    f"""
+                    SELECT pg_size_pretty(pg_total_relation_size(%s)) as size;
+                """,
+                    (self.table_name,),
+                )
+                table_size = cur.fetchone()[0]
+                # Get dimension info
+                cur.execute(
+                    f"""
+                    SELECT column_name, data_type
+                    FROM information_schema.columns
+                    WHERE table_name = %s AND column_name = 'embedding';
+                """,
+                    (self.table_name,),
+                )
+                embedding_info = cur.fetchone()
+                return {
+                    "document_count": doc_count,
+                    "table_size": table_size,
+                    "embedding_dimension": self.dimension,
+                    "table_name": self.table_name,
+                    "embedding_column_type": (
+                        embedding_info[1] if embedding_info else None
+                    ),
+                }
+    def delete_documents(self, document_ids: List[str]) -> int:
+        """
+        Delete documents by their IDs.
+        Args:
+            document_ids: List of document IDs to delete
+        Returns:
+            Number of documents deleted
+        """
+        if not document_ids:
+            return 0
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                # Convert string IDs to integers
+                int_ids = [int(doc_id) for doc_id in document_ids]
+                cur.execute(
+                    f"""
+                    DELETE FROM {self.table_name}
+                    WHERE id = ANY(%s)
+                """,
+                    (int_ids,),
+                )
+                deleted_count = cur.rowcount
+                conn.commit()
+                logger.info(f"Deleted {deleted_count} documents")
+                return deleted_count
+    def delete_all_documents(self) -> int:
+        """
+        Delete all documents from the collection.
+        Returns:
+            Number of documents deleted
+        """
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                cur.execute(f"SELECT COUNT(*) FROM {self.table_name}")
+                count_before = cur.fetchone()[0]
+                cur.execute(f"DELETE FROM {self.table_name}")
+                # Reset the sequence
+                cur.execute(f"ALTER SEQUENCE {self.table_name}_id_seq RESTART WITH 1")
+                conn.commit()
+                logger.info(f"Deleted all {count_before} documents")
+                return count_before
+    def update_document(
+        self,
+        document_id: str,
+        content: Optional[str] = None,
+        embedding: Optional[List[float]] = None,
+        metadata: Optional[Dict[str, Any]] = None,
+    ) -> bool:
+        """
+        Update a document's content, embedding, or metadata.
+        Args:
+            document_id: ID of document to update
+            content: New content (optional)
+            embedding: New embedding (optional)
+            metadata: New metadata (optional)
+        Returns:
+            True if document was updated, False if not found
+        """
+        if not any([content, embedding, metadata]):
+            return False
+        updates = []
+        params = []
+        if content is not None:
+            updates.append("content = %s")
+            params.append(content)
+        if embedding is not None:
+            updates.append("embedding = %s")
+            params.append(embedding)
+        if metadata is not None:
+            updates.append("metadata = %s")
+            params.append(psycopg2.extras.Json(metadata))
+        updates.append("updated_at = CURRENT_TIMESTAMP")
+        params.append(int(document_id))
+        query = f"""
+            UPDATE {self.table_name}
+            SET {', '.join(updates)}
+            WHERE id = %s
+        """
+        with self._get_connection() as conn:
+            with conn.cursor() as cur:
+                cur.execute(query, params)
+                updated = cur.rowcount > 0
+                conn.commit()
+                if updated:
+                    logger.info(f"Updated document {document_id}")
+                else:
+                    logger.warning(f"Document {document_id} not found for update")
+                return updated
+    def get_document(self, document_id: str) -> Optional[Dict[str, Any]]:
+        """
+        Get a single document by ID.
+        Args:
+            document_id: ID of document to retrieve
+        Returns:
+            Document dictionary or None if not found
+        """
+        with self._get_connection() as conn:
+            with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
+                cur.execute(
+                    f"""
+                    SELECT id, content, metadata, created_at, updated_at
+                    FROM {self.table_name}
+                    WHERE id = %s
+                """,
+                    (int(document_id),),
+                )
+                row = cur.fetchone()
+                if row:
+                    return {
+                        "id": str(row["id"]),
+                        "content": row["content"],
+                        "metadata": row["metadata"] or {},
+                        "created_at": (
+                            row["created_at"].isoformat() if row["created_at"] else None
+                        ),
+                        "updated_at": (
+                            row["updated_at"].isoformat() if row["updated_at"] else None
+                        ),
+                    }
+                return None
+    def health_check(self) -> Dict[str, Any]:
+        """
+        Check the health of the vector database service.
+        Returns:
+            Health status dictionary
+        """
+        try:
+            with self._get_connection() as conn:
+                with conn.cursor() as cur:
+                    # Test basic connectivity
+                    cur.execute("SELECT 1")
+                    # Check if pgvector extension is installed
+                    cur.execute(
+                        "SELECT EXISTS(SELECT 1 FROM pg_extension WHERE extname = 'vector')"
+                    )
+                    pgvector_installed = cur.fetchone()[0]
+                    # Get basic stats
+                    info = self.get_collection_info()
+                    return {
+                        "status": "healthy",
+                        "pgvector_installed": pgvector_installed,
+                        "connection": "ok",
+                        "collection_info": info,
+                    }
+        except Exception as e:
+            logger.error(f"Health check failed: {e}")
+            return {"status": "unhealthy", "error": str(e), "connection": "failed"}

src/vector_store/vector_db.py CHANGED Viewed

@@ -1,12 +1,42 @@
 import logging
 from pathlib import Path
-from typing import Any, Dict, List
 import chromadb
 from src.utils.memory_utils import log_memory_checkpoint, memory_monitor
 class VectorDatabase:
     """ChromaDB integration for vector storage and similarity search"""

 import logging
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Protocol, Union
 import chromadb
+from src.config import VECTOR_STORAGE_TYPE
 from src.utils.memory_utils import log_memory_checkpoint, memory_monitor
+def create_vector_database(
+    persist_path: Optional[str] = None, collection_name: Optional[str] = None
+):
+    """
+    Factory function to create the appropriate vector database implementation.
+    Args:
+        persist_path: Path for persistence (used by ChromaDB)
+        collection_name: Name of the collection
+    Returns:
+        Vector database implementation
+    """
+    if VECTOR_STORAGE_TYPE == "postgres":
+        from src.vector_db.postgres_adapter import PostgresVectorAdapter
+        return PostgresVectorAdapter(
+            table_name=collection_name or "document_embeddings"
+        )
+    else:
+        # Default to ChromaDB
+        from src.config import COLLECTION_NAME, VECTOR_DB_PERSIST_PATH
+        return VectorDatabase(
+            persist_path=persist_path or VECTOR_DB_PERSIST_PATH,
+            collection_name=collection_name or COLLECTION_NAME,
+        )
 class VectorDatabase:
     """ChromaDB integration for vector storage and similarity search"""

tests/test_embedding/test_embedding_service.py CHANGED Viewed

@@ -1,13 +1,14 @@
 from src.embedding.embedding_service import EmbeddingService
 def test_embedding_service_initialization():
     """Test EmbeddingService initialization"""
-    # Test will fail initially - we'll implement EmbeddingService to make it pass
     service = EmbeddingService()
     assert service is not None
-    assert service.model_name == "all-MiniLM-L12-v2"
     assert service.device == "cpu"
@@ -171,12 +172,12 @@ def test_similarity_makes_sense():
     embed3 = service.embed_text(text3)
     # Calculate simple cosine similarity (for validation)
-    def cosine_similarity(a, b):
         import numpy as np
         a_np = np.array(a)
         b_np = np.array(b)
-        return np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np))
     sim_1_2 = cosine_similarity(embed1, embed2)  # Similar texts
     sim_1_3 = cosine_similarity(embed1, embed3)  # Different texts

+from typing import List
 from src.embedding.embedding_service import EmbeddingService
 def test_embedding_service_initialization():
     """Test EmbeddingService initialization"""
     service = EmbeddingService()
     assert service is not None
+    assert service.model_name == "sentence-transformers/all-MiniLM-L6-v2"
     assert service.device == "cpu"
     embed3 = service.embed_text(text3)
     # Calculate simple cosine similarity (for validation)
+    def cosine_similarity(a: List[float], b: List[float]) -> float:
         import numpy as np
         a_np = np.array(a)
         b_np = np.array(b)
+        return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))
     sim_1_2 = cosine_similarity(embed1, embed2)  # Similar texts
     sim_1_3 = cosine_similarity(embed1, embed3)  # Different texts

tests/test_vector_store/test_postgres_vector.py ADDED Viewed

	@@ -0,0 +1,366 @@

+"""
+Tests for PostgresVectorService and PostgresVectorAdapter.
+"""
+import os
+from typing import Any, Dict, List
+from unittest.mock import MagicMock, Mock, patch
+import pytest
+from src.vector_db.postgres_adapter import PostgresVectorAdapter
+from src.vector_db.postgres_vector_service import PostgresVectorService
+class TestPostgresVectorService:
+    """Test PostgresVectorService functionality."""
+    def setup_method(self):
+        """Setup test fixtures."""
+        self.test_connection_string = "postgresql://test:test@localhost:5432/test_db"
+        self.test_table_name = "test_embeddings"
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_initialization(self, mock_connect):
+        """Test service initialization."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        assert service.connection_string == self.test_connection_string
+        assert service.table_name == self.test_table_name
+        # Verify initialization queries were called
+        mock_cursor.execute.assert_any_call("CREATE EXTENSION IF NOT EXISTS vector;")
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_add_documents(self, mock_connect):
+        """Test adding documents."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        mock_cursor.fetchone.return_value = [1]  # Mock returned ID
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        texts = ["test document 1", "test document 2"]
+        embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
+        metadatas = [{"source": "test1"}, {"source": "test2"}]
+        doc_ids = service.add_documents(texts, embeddings, metadatas)
+        assert len(doc_ids) == 2
+        assert all(isinstance(doc_id, str) for doc_id in doc_ids)
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_similarity_search(self, mock_connect):
+        """Test similarity search."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        # Mock search results
+        mock_cursor.fetchall.return_value = [
+            {
+                "id": 1,
+                "content": "test document",
+                "metadata": {"source": "test"},
+                "similarity_score": 0.85,
+            }
+        ]
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        query_embedding = [0.1, 0.2, 0.3]
+        results = service.similarity_search(query_embedding, k=5)
+        assert len(results) == 1
+        assert results[0]["id"] == "1"
+        assert results[0]["content"] == "test document"
+        assert "similarity_score" in results[0]
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_get_collection_info(self, mock_connect):
+        """Test getting collection information."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        # Mock collection info queries
+        mock_cursor.fetchone.side_effect = [
+            [100],  # document count
+            ["1.2 MB"],  # table size
+            ["embedding", "vector(384)"],  # column info
+        ]
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        service.dimension = 384  # Set dimension
+        info = service.get_collection_info()
+        assert info["document_count"] == 100
+        assert info["table_size"] == "1.2 MB"
+        assert info["embedding_dimension"] == 384
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_delete_documents(self, mock_connect):
+        """Test deleting specific documents."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_cursor.rowcount = 2
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        deleted_count = service.delete_documents(["1", "2"])
+        assert deleted_count == 2
+    @patch("src.vector_db.postgres_vector_service.psycopg2.connect")
+    def test_health_check(self, mock_connect):
+        """Test health check functionality."""
+        mock_conn = Mock()
+        mock_cursor = Mock()
+        mock_conn.cursor.return_value.__enter__.return_value = mock_cursor
+        # Mock health check queries
+        mock_cursor.fetchone.side_effect = [
+            [1],  # SELECT 1
+            [True],  # pgvector extension check
+            [10],  # document count
+            ["500 KB"],  # table size
+            ["embedding", "vector(384)"],  # column info
+        ]
+        mock_connect.return_value = mock_conn
+        service = PostgresVectorService(
+            connection_string=self.test_connection_string,
+            table_name=self.test_table_name,
+        )
+        service.dimension = 384
+        health = service.health_check()
+        assert health["status"] == "healthy"
+        assert health["pgvector_installed"] is True
+        assert health["connection"] == "ok"
+class TestPostgresVectorAdapter:
+    """Test PostgresVectorAdapter compatibility with ChromaDB interface."""
+    def setup_method(self):
+        """Setup test fixtures."""
+        self.test_table_name = "test_embeddings"
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_adapter_initialization(self, mock_service_class):
+        """Test adapter initialization."""
+        mock_service = Mock()
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        assert adapter.collection_name == self.test_table_name
+        mock_service_class.assert_called_once_with(table_name=self.test_table_name)
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_add_embeddings_chromadb_compatibility(self, mock_service_class):
+        """Test add_embeddings method compatibility with ChromaDB interface."""
+        mock_service = Mock()
+        mock_service.add_documents.return_value = ["1", "2"]
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        embeddings = [[0.1, 0.2], [0.3, 0.4]]
+        chunk_ids = ["chunk1", "chunk2"]
+        documents = ["doc1", "doc2"]
+        metadatas = [{"source": "test1"}, {"source": "test2"}]
+        result = adapter.add_embeddings(embeddings, chunk_ids, documents, metadatas)
+        assert result is True
+        mock_service.add_documents.assert_called_once_with(
+            documents, embeddings, metadatas
+        )
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_search_chromadb_compatibility(self, mock_service_class):
+        """Test search method compatibility with ChromaDB interface."""
+        mock_service = Mock()
+        mock_service.similarity_search.return_value = [
+            {
+                "id": "1",
+                "content": "test document",
+                "metadata": {"source": "test"},
+                "similarity_score": 0.85,
+            }
+        ]
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        query_embedding = [0.1, 0.2, 0.3]
+        results = adapter.search(query_embedding, top_k=5)
+        assert len(results) == 1
+        assert results[0]["id"] == "1"
+        assert results[0]["document"] == "test document"
+        assert results[0]["metadata"] == {"source": "test"}
+        assert "distance" in results[0]
+        assert results[0]["distance"] == pytest.approx(0.15)  # 1.0 - 0.85
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_get_count_chromadb_compatibility(self, mock_service_class):
+        """Test get_count method compatibility with ChromaDB interface."""
+        mock_service = Mock()
+        mock_service.get_collection_info.return_value = {"document_count": 42}
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        count = adapter.get_count()
+        assert count == 42
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_batch_operations(self, mock_service_class):
+        """Test batch operations compatibility."""
+        mock_service = Mock()
+        mock_service.add_documents.return_value = ["1", "2"]
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        batch_embeddings = [[[0.1, 0.2], [0.3, 0.4]], [[0.5, 0.6]]]
+        batch_chunk_ids = [["chunk1", "chunk2"], ["chunk3"]]
+        batch_documents = [["doc1", "doc2"], ["doc3"]]
+        batch_metadatas = [
+            [{"source": "test1"}, {"source": "test2"}],
+            [{"source": "test3"}],
+        ]
+        total_added = adapter.add_embeddings_batch(
+            batch_embeddings, batch_chunk_ids, batch_documents, batch_metadatas
+        )
+        assert total_added == 3  # 2 + 1
+        assert mock_service.add_documents.call_count == 2
+class TestVectorDatabaseFactory:
+    """Test the vector database factory function."""
+    @patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "postgres"})
+    @patch("src.vector_store.vector_db.PostgresVectorAdapter")
+    def test_factory_creates_postgres_adapter(self, mock_adapter_class):
+        """Test factory creates PostgreSQL adapter when configured."""
+        from src.vector_store.vector_db import create_vector_database
+        mock_adapter = Mock()
+        mock_adapter_class.return_value = mock_adapter
+        db = create_vector_database(collection_name="test_collection")
+        assert db == mock_adapter
+        mock_adapter_class.assert_called_once_with(table_name="test_collection")
+    @patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "chroma"})
+    @patch("src.vector_store.vector_db.VectorDatabase")
+    def test_factory_creates_chroma_database(self, mock_vector_db_class):
+        """Test factory creates ChromaDB when configured."""
+        from src.vector_store.vector_db import create_vector_database
+        mock_db = Mock()
+        mock_vector_db_class.return_value = mock_db
+        db = create_vector_database(
+            persist_path="/test/path", collection_name="test_collection"
+        )
+        assert db == mock_db
+        mock_vector_db_class.assert_called_once_with(
+            persist_path="/test/path", collection_name="test_collection"
+        )
+# Integration tests (require actual database)
+@pytest.mark.integration
+class TestPostgresIntegration:
+    """Integration tests that require a real PostgreSQL database."""
+    @pytest.fixture
+    def postgres_service(self):
+        """Create a PostgreSQL service for testing."""
+        # Only run if DATABASE_URL is set
+        database_url = os.getenv("TEST_DATABASE_URL")
+        if not database_url:
+            pytest.skip("TEST_DATABASE_URL not set")
+        service = PostgresVectorService(
+            connection_string=database_url, table_name="test_embeddings"
+        )
+        # Clean up before test
+        service.delete_all_documents()
+        yield service
+        # Clean up after test
+        try:
+            service.delete_all_documents()
+        except:
+            pass  # Ignore cleanup errors
+    def test_full_workflow(self, postgres_service):
+        """Test complete workflow with real database."""
+        # Add documents
+        texts = ["This is a test document.", "Another test document."]
+        embeddings = [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]]
+        metadatas = [{"source": "test1"}, {"source": "test2"}]
+        doc_ids = postgres_service.add_documents(texts, embeddings, metadatas)
+        assert len(doc_ids) == 2
+        # Search
+        query_embedding = [0.1, 0.2, 0.3, 0.4]
+        results = postgres_service.similarity_search(query_embedding, k=2)
+        assert len(results) <= 2
+        if results:
+            assert "content" in results[0]
+            assert "similarity_score" in results[0]
+        # Get info
+        info = postgres_service.get_collection_info()
+        assert info["document_count"] == 2
+        # Health check
+        health = postgres_service.health_check()
+        assert health["status"] == "healthy"

tests/test_vector_store/test_postgres_vector_simple.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""
+Tests for PostgresVectorService and PostgresVectorAdapter (simplified).
+"""
+import os
+from unittest.mock import Mock, patch
+import pytest
+from src.vector_db.postgres_adapter import PostgresVectorAdapter
+class TestPostgresVectorAdapter:
+    """Test PostgresVectorAdapter compatibility."""
+    def setup_method(self):
+        """Setup test fixtures."""
+        self.test_table_name = "test_embeddings"
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_adapter_initialization(self, mock_service_class):
+        """Test adapter initialization."""
+        mock_service = Mock()
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        assert adapter.collection_name == self.test_table_name
+    @patch("src.vector_db.postgres_adapter.PostgresVectorService")
+    def test_get_count_chromadb_compatibility(self, mock_service_class):
+        """Test get_count method compatibility with ChromaDB interface."""
+        mock_service = Mock()
+        mock_service.get_collection_info.return_value = {"document_count": 42}
+        mock_service_class.return_value = mock_service
+        adapter = PostgresVectorAdapter(table_name=self.test_table_name)
+        count = adapter.get_count()
+        assert count == 42
+class TestVectorDatabaseFactory:
+    """Test the vector database factory function."""
+    @patch.dict(os.environ, {"VECTOR_STORAGE_TYPE": "chroma"})
+    @patch("src.vector_store.vector_db.VectorDatabase")
+    def test_factory_creates_chroma_database(self, mock_vector_db_class):
+        """Test factory creates ChromaDB when configured."""
+        from src.vector_store.vector_db import create_vector_database
+        mock_db = Mock()
+        mock_vector_db_class.return_value = mock_db
+        db = create_vector_database(
+            persist_path="/test/path", collection_name="test_collection"
+        )
+        assert db == mock_db
+# Integration tests (require actual database)
+@pytest.mark.integration
+class TestPostgresIntegration:
+    """Integration tests that require a real PostgreSQL database."""
+    def test_skip_integration(self):
+        """Skip integration tests without database."""
+        database_url = os.getenv("TEST_DATABASE_URL")
+        if not database_url:
+            pytest.skip("TEST_DATABASE_URL not set")