File size: 6,652 Bytes
dca679b
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
 
 
45cf08e
 
 
 
 
 
dca679b
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
45cf08e
dca679b
 
 
 
 
 
 
 
 
45cf08e
dca679b
45cf08e
dca679b
 
 
45cf08e
dca679b
 
 
 
45cf08e
 
 
 
 
 
 
dca679b
 
45cf08e
dca679b
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# PostgreSQL Migration Guide

## Overview

This branch implements PostgreSQL with pgvector as an alternative to ChromaDB for vector storage. This reduces memory usage from 400MB+ to ~50-100MB by storing vectors on disk instead of in RAM.

## What's Been Implemented

### 1. PostgresVectorService (`src/vector_db/postgres_vector_service.py`)

- Full PostgreSQL integration with pgvector extension
- Automatic table creation and indexing
- Similarity search using cosine distance
- Document CRUD operations
- Health monitoring and collection info

### 2. PostgresVectorAdapter (`src/vector_db/postgres_adapter.py`)

- Compatibility layer for existing ChromaDB interface
- Ensures seamless migration without code changes
- Converts between PostgreSQL and ChromaDB result formats

### 3. Updated Configuration (`src/config.py`)

- Added `VECTOR_STORAGE_TYPE` environment variable
- PostgreSQL connection settings
- Memory optimization parameters

### 4. Factory Pattern (`src/vector_store/vector_db.py`)

- `create_vector_database()` function selects backend automatically
- Supports both ChromaDB and PostgreSQL based on configuration

### 5. Migration Script (`scripts/migrate_to_postgres.py`)

- Data optimization (text summarization, metadata cleaning)
- Batch processing with memory management
- Handles 4GB → 1GB data reduction for free tier

### 6. Tests (`tests/test_vector_store/test_postgres_vector.py`)

- Unit tests with mocked dependencies
- Integration tests for real database
- Compatibility tests for ChromaDB interface

## Setup Instructions

### Step 1: Create Render PostgreSQL Database

1. Go to Render Dashboard
2. Create → PostgreSQL
3. Choose "Free" plan (1GB storage, 30 days)
4. Save the connection details

### Step 2: Enable pgvector Extension

You have several options to enable pgvector:

**Option A: Use the initialization script (Recommended)**

```bash
# Set your database URL
export DATABASE_URL="postgresql://user:password@host:port/database"

# Run the initialization script
python scripts/init_pgvector.py
```

**Option B: Manual SQL**
Connect to your database and run:

```sql
CREATE EXTENSION IF NOT EXISTS vector;
```

**Option C: From Render Dashboard**

1. Go to your PostgreSQL service → Info tab
2. Use the "PSQL Command" to connect
3. Run: `CREATE EXTENSION IF NOT EXISTS vector;`

The initialization script (`scripts/init_pgvector.py`) will:

- Test database connection
- Check PostgreSQL version compatibility (13+)
- Install pgvector extension safely
- Verify vector operations work correctly
- Provide detailed logging and error messages

### Step 3: Update Environment Variables

Add to your Render environment variables:

```bash
DATABASE_URL=postgresql://username:password@host:port/database
VECTOR_STORAGE_TYPE=postgres
MEMORY_LIMIT_MB=400
```

### Step 4: Install Dependencies

```bash
pip install psycopg2-binary==2.9.7
```

### Step 5: Run Migration (Optional)

If you have existing ChromaDB data:

```bash
python scripts/migrate_to_postgres.py --database-url="your-connection-string"
```

## Usage

### Switch to PostgreSQL

Set environment variable:

```bash
export VECTOR_STORAGE_TYPE=postgres
```

### Use in Code (No Changes Required!)

```python
from src.vector_store.vector_db import create_vector_database

# Automatically uses PostgreSQL if VECTOR_STORAGE_TYPE=postgres
vector_db = create_vector_database()
vector_db.add_embeddings(embeddings, ids, documents, metadatas)
results = vector_db.search(query_embedding, top_k=5)
```

## Expected Memory Reduction

| Component        | Before (ChromaDB) | After (PostgreSQL)   | Savings       |
| ---------------- | ----------------- | -------------------- | ------------- |
| Vector Storage   | 200-300MB         | 0MB (disk)           | 200-300MB     |
| Embedding Model  | 100MB             | 50MB (smaller model) | 50MB          |
| Application Code | 50-100MB          | 50-100MB             | 0MB           |
| **Total**        | **350-500MB**     | **50-150MB**         | **300-350MB** |

## Migration Optimizations

### Data Size Reduction

- **Text Summarization**: Documents truncated to 1000 characters
- **Metadata Cleaning**: Only essential fields kept
- **Dimension Reduction**: Can use smaller embedding models
- **Quality Filtering**: Skip very short or low-quality documents

### Memory Management

- **Batch Processing**: Process documents in small batches
- **Garbage Collection**: Aggressive cleanup between operations
- **Streaming**: Process data without loading everything into memory

## Testing

### Unit Tests

```bash
pytest tests/test_vector_store/test_postgres_vector.py -v
```

### Integration Tests (Requires Database)

```bash
export TEST_DATABASE_URL="postgresql://test:test@localhost:5432/test_db"
pytest tests/test_vector_store/test_postgres_vector.py -m integration -v
```

### Migration Test

```bash
python scripts/migrate_to_postgres.py --test-only
```

## Deployment

### Local Development

Keep using ChromaDB:

```bash
export VECTOR_STORAGE_TYPE=chroma
```

### Production (Render)

Switch to PostgreSQL:

```bash
export VECTOR_STORAGE_TYPE=postgres
export DATABASE_URL="your-render-postgres-url"
```

## Troubleshooting

### Common Issues

1. **"pgvector extension not found"**

   - Run `CREATE EXTENSION vector;` in your database

2. **Connection errors**

   - Verify DATABASE_URL format: `postgresql://user:pass@host:port/db`
   - Check firewall/network connectivity

3. **Memory still high**
   - Verify `VECTOR_STORAGE_TYPE=postgres`
   - Check that old ChromaDB files aren't being loaded

### Monitoring

```python
from src.vector_db.postgres_vector_service import PostgresVectorService

service = PostgresVectorService()
health = service.health_check()
print(health)  # Shows connection status, document count, etc.
```

## Rollback Plan

If issues occur, simply change back to ChromaDB:

```bash
export VECTOR_STORAGE_TYPE=chroma
```

The factory pattern ensures seamless switching between backends.

## Performance Comparison

| Operation   | ChromaDB   | PostgreSQL | Notes                  |
| ----------- | ---------- | ---------- | ---------------------- |
| Insert      | Fast       | Medium     | Network overhead       |
| Search      | Very Fast  | Fast       | pgvector is optimized  |
| Memory      | High       | Low        | Vectors stored on disk |
| Persistence | File-based | Database   | More reliable          |
| Scaling     | Limited    | Excellent  | Can upgrade storage    |

## Next Steps

1. Test locally with PostgreSQL
2. Create Render PostgreSQL database
3. Run migration script
4. Deploy with `VECTOR_STORAGE_TYPE=postgres`
5. Monitor memory usage in production