Spaces:

sethmcknight
/

msse-ai-engineering

Sleeping

App Files Files Community

Tobias Pasquale commited on Oct 18

Commit

5abed81

2 Parent(s): ca68eb2 57abf2f

Merge pull request #27 from sethmcknight/feat/enhanced-ingestion-pipeline

Browse files

Files changed (7) hide show

.gitignore +3 -0
CHANGELOG.md +50 -0
app.py +126 -11
src/ingestion/ingestion_pipeline.py +139 -5
tests/test_app.py +122 -0
tests/test_enhanced_app.py +103 -0
tests/test_ingestion/test_enhanced_ingestion_pipeline.py +231 -0

.gitignore CHANGED Viewed

@@ -36,3 +36,6 @@ planning/
 *.log
 *.tmp
 .env.local

 *.log
 *.tmp
 .env.local
+# Vector Database (ChromaDB data)
+data/chroma_db/

CHANGELOG.md CHANGED Viewed

@@ -19,6 +19,56 @@ Each entry includes:
 ---
 ## Changelog Entries
 ### 2025-12-28 - Phase 2B SearchService Implementation

 ---
+### 2025-10-17 - Enhanced Ingestion Pipeline with Embeddings Integration
+**Entry #019** | **Action Type**: CREATE/UPDATE | **Component**: Enhanced Ingestion Pipeline | **Issue**: #21
+- **Files Changed**:
+  - `src/ingestion/ingestion_pipeline.py` (ENHANCED) - Added embedding integration and enhanced reporting
+  - `app.py` (UPDATED) - Enhanced /ingest endpoint with configurable embedding storage
+  - `tests/test_ingestion/test_enhanced_ingestion_pipeline.py` (NEW) - Comprehensive test suite for enhanced functionality
+  - `tests/test_enhanced_app.py` (NEW) - Flask endpoint tests for enhanced ingestion
+- **Implementation Details**:
+  - **Core Features**: Embeddings integration with configurable on/off, batch processing with 32-item batches, enhanced API response with statistics
+  - **Backward Compatibility**: Maintained original `process_directory()` method for existing tests, added new `process_directory_with_embeddings()` method
+  - **API Enhancement**: /ingest endpoint accepts `{"store_embeddings": true/false}` parameter, enhanced response includes files_processed, embeddings_stored, failed_files
+  - **Error Handling**: Comprehensive error handling with graceful degradation, detailed failure reporting per file and batch
+  - **Batch Processing**: Memory-efficient 32-chunk batches for embedding generation, progress reporting during processing
+  - **Integration**: Seamless integration with existing EmbeddingService and VectorDatabase components
+- **Test Coverage**:
+  - ✅ 14/14 enhanced ingestion tests passing (100% success rate)
+  - Unit tests with mocked embedding services (4 tests)
+  - Integration tests with real components (4 tests)
+  - Backward compatibility validation (2 tests)
+  - Flask endpoint testing (4 tests)
+  - ✅ All existing tests maintained backward compatibility (8/8 passing)
+- **Quality Assurance**:
+  - ✅ Comprehensive error handling with graceful degradation
+  - ✅ Memory-efficient batch processing implementation
+  - ✅ Backward compatibility maintained for existing API
+  - ✅ Enhanced reporting and statistics generation
+- **Performance**:
+  - Batch processing: 32 chunks per batch for memory efficiency
+  - Progress reporting: Real-time batch processing updates
+  - Error resilience: Continues processing despite individual file/batch failures
+- **Flask API Enhancement**:
+  - Enhanced /ingest endpoint with JSON parameter support
+  - Configurable embedding storage: `{"store_embeddings": true/false}`
+  - Enhanced response format with comprehensive statistics
+  - Backward compatible with existing clients
+- **Dependencies**:
+  - Builds on existing EmbeddingService and VectorDatabase (Phase 2A)
+  - Integrates with SearchService for complete RAG pipeline
+  - Maintains compatibility with existing ingestion components
+- **CI/CD**: ✅ All 71 tests pass including new enhanced functionality
+- **Notes**:
+  - Addresses GitHub Issue #21 requirements completely
+  - Maintains full backward compatibility while adding enhanced features
+  - Ready for integration with SearchService and upcoming /search endpoint
+  - Sets foundation for complete RAG pipeline implementation
+---
 ## Changelog Entries
 ### 2025-12-28 - Phase 2B SearchService Implementation

app.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from flask import Flask, jsonify, render_template
 app = Flask(__name__)
@@ -21,7 +21,7 @@ def health():
 @app.route("/ingest", methods=["POST"])
 def ingest():
-    """Endpoint to trigger document ingestion"""
     try:
         from src.config import (
             CORPUS_DIRECTORY,
@@ -31,23 +31,138 @@ def ingest():
         )
         from src.ingestion.ingestion_pipeline import IngestionPipeline
         pipeline = IngestionPipeline(
-            chunk_size=DEFAULT_CHUNK_SIZE, overlap=DEFAULT_OVERLAP, seed=RANDOM_SEED
         )
-        chunks = pipeline.process_directory(CORPUS_DIRECTORY)
-        return jsonify(
-            {
-                "status": "success",
-                "chunks_processed": len(chunks),
-                "message": f"Successfully processed {len(chunks)} chunks",
-            }
-        )
     except Exception as e:
         return jsonify({"status": "error", "message": str(e)}), 500
 if __name__ == "__main__":
     app.run(debug=True)

+from flask import Flask, jsonify, render_template, request
 app = Flask(__name__)
 @app.route("/ingest", methods=["POST"])
 def ingest():
+    """Endpoint to trigger document ingestion with embeddings"""
     try:
         from src.config import (
             CORPUS_DIRECTORY,
         )
         from src.ingestion.ingestion_pipeline import IngestionPipeline
+        # Get optional parameters from request
+        data = request.get_json() if request.is_json else {}
+        store_embeddings = data.get("store_embeddings", True)
         pipeline = IngestionPipeline(
+            chunk_size=DEFAULT_CHUNK_SIZE,
+            overlap=DEFAULT_OVERLAP,
+            seed=RANDOM_SEED,
+            store_embeddings=store_embeddings,
         )
+        result = pipeline.process_directory_with_embeddings(CORPUS_DIRECTORY)
+        # Create response with enhanced information
+        response = {
+            "status": result["status"],
+            "chunks_processed": result["chunks_processed"],
+            "files_processed": result["files_processed"],
+            "embeddings_stored": result["embeddings_stored"],
+            "store_embeddings": result["store_embeddings"],
+            "message": (
+                f"Successfully processed {result['chunks_processed']} chunks "
+                f"from {result['files_processed']} files"
+            ),
+        }
+        # Include failed files info if any
+        if result["failed_files"]:
+            response["failed_files"] = result["failed_files"]
+            failed_count = len(result["failed_files"])
+            response["warnings"] = f"{failed_count} files failed to process"
+        return jsonify(response)
     except Exception as e:
         return jsonify({"status": "error", "message": str(e)}), 500
+@app.route("/search", methods=["POST"])
+def search():
+    """
+    Endpoint to perform semantic search on ingested documents.
+    Accepts JSON requests with query text and optional parameters.
+    Returns semantically similar document chunks.
+    """
+    try:
+        # Validate request contains JSON data
+        if not request.is_json:
+            return (
+                jsonify(
+                    {
+                        "status": "error",
+                        "message": "Content-Type must be application/json",
+                    }
+                ),
+                400,
+            )
+        data = request.get_json()
+        # Validate required query parameter
+        query = data.get("query")
+        if query is None:
+            return (
+                jsonify({"status": "error", "message": "Query parameter is required"}),
+                400,
+            )
+        if not isinstance(query, str) or not query.strip():
+            return (
+                jsonify(
+                    {"status": "error", "message": "Query must be a non-empty string"}
+                ),
+                400,
+            )
+        # Extract optional parameters with defaults
+        top_k = data.get("top_k", 5)
+        threshold = data.get("threshold", 0.3)
+        # Validate parameters
+        if not isinstance(top_k, int) or top_k <= 0:
+            return (
+                jsonify(
+                    {"status": "error", "message": "top_k must be a positive integer"}
+                ),
+                400,
+            )
+        if not isinstance(threshold, (int, float)) or not (0.0 <= threshold <= 1.0):
+            return (
+                jsonify(
+                    {
+                        "status": "error",
+                        "message": "threshold must be a number between 0 and 1",
+                    }
+                ),
+                400,
+            )
+        # Initialize search components
+        from src.config import COLLECTION_NAME, VECTOR_DB_PERSIST_PATH
+        from src.embedding.embedding_service import EmbeddingService
+        from src.search.search_service import SearchService
+        from src.vector_store.vector_db import VectorDatabase
+        vector_db = VectorDatabase(VECTOR_DB_PERSIST_PATH, COLLECTION_NAME)
+        embedding_service = EmbeddingService()
+        search_service = SearchService(vector_db, embedding_service)
+        # Perform search
+        results = search_service.search(
+            query=query.strip(), top_k=top_k, threshold=threshold
+        )
+        # Format response
+        response = {
+            "status": "success",
+            "query": query.strip(),
+            "results_count": len(results),
+            "results": results,
+        }
+        return jsonify(response)
+    except ValueError as e:
+        return jsonify({"status": "error", "message": str(e)}), 400
+    except Exception as e:
+        return jsonify({"status": "error", "message": f"Search failed: {str(e)}"}), 500
 if __name__ == "__main__":
     app.run(debug=True)

src/ingestion/ingestion_pipeline.py CHANGED Viewed

@@ -1,14 +1,24 @@
 from pathlib import Path
-from typing import Any, Dict, List
 from .document_chunker import DocumentChunker
 from .document_parser import DocumentParser
 class IngestionPipeline:
-    """Complete ingestion pipeline for processing document corpus"""
-    def __init__(self, chunk_size: int = 1000, overlap: int = 200, seed: int = 42):
         """
         Initialize the ingestion pipeline
@@ -16,16 +26,35 @@ class IngestionPipeline:
             chunk_size: Size of text chunks
             overlap: Overlap between chunks
             seed: Random seed for reproducibility
         """
         self.parser = DocumentParser()
         self.chunker = DocumentChunker(
             chunk_size=chunk_size, overlap=overlap, seed=seed
         )
         self.seed = seed
     def process_directory(self, directory_path: str) -> List[Dict[str, Any]]:
         """
-        Process all supported documents in a directory
         Args:
             directory_path: Path to directory containing documents
@@ -54,6 +83,63 @@ class IngestionPipeline:
         return all_chunks
     def process_file(self, file_path: str) -> List[Dict[str, Any]]:
         """
         Process a single file through the complete pipeline
@@ -73,3 +159,51 @@ class IngestionPipeline:
         )
         return chunks

 from pathlib import Path
+from typing import Any, Dict, List, Optional
+from ..embedding.embedding_service import EmbeddingService
+from ..vector_store.vector_db import VectorDatabase
 from .document_chunker import DocumentChunker
 from .document_parser import DocumentParser
 class IngestionPipeline:
+    """Complete ingestion pipeline for processing document corpus with embeddings"""
+    def __init__(
+        self,
+        chunk_size: int = 1000,
+        overlap: int = 200,
+        seed: int = 42,
+        store_embeddings: bool = True,
+        vector_db: Optional[VectorDatabase] = None,
+        embedding_service: Optional[EmbeddingService] = None,
+    ):
         """
         Initialize the ingestion pipeline
             chunk_size: Size of text chunks
             overlap: Overlap between chunks
             seed: Random seed for reproducibility
+            store_embeddings: Whether to generate and store embeddings
+            vector_db: Vector database instance for storing embeddings
+            embedding_service: Embedding service for generating embeddings
         """
         self.parser = DocumentParser()
         self.chunker = DocumentChunker(
             chunk_size=chunk_size, overlap=overlap, seed=seed
         )
         self.seed = seed
+        self.store_embeddings = store_embeddings
+        # Initialize embedding components if storing embeddings
+        if store_embeddings:
+            self.embedding_service = embedding_service or EmbeddingService()
+            if vector_db is None:
+                from ..config import COLLECTION_NAME, VECTOR_DB_PERSIST_PATH
+                self.vector_db = VectorDatabase(
+                    persist_path=VECTOR_DB_PERSIST_PATH, collection_name=COLLECTION_NAME
+                )
+            else:
+                self.vector_db = vector_db
+        else:
+            self.embedding_service = None
+            self.vector_db = None
     def process_directory(self, directory_path: str) -> List[Dict[str, Any]]:
         """
+        Process all supported documents in a directory (backward compatible)
         Args:
             directory_path: Path to directory containing documents
         return all_chunks
+    def process_directory_with_embeddings(self, directory_path: str) -> Dict[str, Any]:
+        """
+        Process all supported documents in a directory with embeddings and enhanced
+        reporting
+        Args:
+            directory_path: Path to directory containing documents
+        Returns:
+            Dictionary with processing results and statistics
+        """
+        directory = Path(directory_path)
+        if not directory.exists():
+            raise FileNotFoundError(f"Directory not found: {directory_path}")
+        all_chunks = []
+        processed_files = 0
+        failed_files = []
+        embeddings_stored = 0
+        # Process each supported file
+        for file_path in directory.iterdir():
+            if (
+                file_path.is_file()
+                and file_path.suffix.lower() in self.parser.SUPPORTED_FORMATS
+            ):
+                try:
+                    chunks = self.process_file(str(file_path))
+                    all_chunks.extend(chunks)
+                    processed_files += 1
+                except Exception as e:
+                    print(f"Warning: Failed to process {file_path}: {e}")
+                    failed_files.append({"file": str(file_path), "error": str(e)})
+                    continue
+        # Generate and store embeddings if enabled
+        if (
+            self.store_embeddings
+            and all_chunks
+            and self.embedding_service
+            and self.vector_db
+        ):
+            try:
+                embeddings_stored = self._store_embeddings_batch(all_chunks)
+            except Exception as e:
+                print(f"Warning: Failed to store embeddings: {e}")
+        return {
+            "status": "success",
+            "chunks_processed": len(all_chunks),
+            "files_processed": processed_files,
+            "failed_files": failed_files,
+            "embeddings_stored": embeddings_stored,
+            "store_embeddings": self.store_embeddings,
+            "chunks": all_chunks,  # Include chunks for backward compatibility
+        }
     def process_file(self, file_path: str) -> List[Dict[str, Any]]:
         """
         Process a single file through the complete pipeline
         )
         return chunks
+    def _store_embeddings_batch(self, chunks: List[Dict[str, Any]]) -> int:
+        """
+        Generate embeddings and store chunks in vector database
+        Args:
+            chunks: List of text chunks with metadata
+        Returns:
+            Number of embeddings stored successfully
+        """
+        if not self.embedding_service or not self.vector_db:
+            return 0
+        stored_count = 0
+        batch_size = 32  # Process in batches for memory efficiency
+        for i in range(0, len(chunks), batch_size):
+            batch = chunks[i : i + batch_size]
+            try:
+                # Extract texts and prepare data for vector storage
+                texts = [chunk["content"] for chunk in batch]
+                chunk_ids = [chunk["metadata"]["chunk_id"] for chunk in batch]
+                metadatas = [chunk["metadata"] for chunk in batch]
+                # Generate embeddings for the batch
+                embeddings = self.embedding_service.embed_texts(texts)
+                # Store in vector database
+                self.vector_db.add_embeddings(
+                    embeddings=embeddings,
+                    chunk_ids=chunk_ids,
+                    documents=texts,
+                    metadatas=metadatas,
+                )
+                stored_count += len(batch)
+                print(
+                    f"Stored embeddings for batch {i // batch_size + 1}: "
+                    f"{len(batch)} chunks"
+                )
+            except Exception as e:
+                print(f"Warning: Failed to store batch {i // batch_size + 1}: {e}")
+                continue
+        return stored_count

tests/test_app.py CHANGED Viewed

@@ -1,3 +1,5 @@
 import pytest
 from app import app as flask_app
@@ -38,3 +40,123 @@ def test_ingest_endpoint_exists():
     response = client.post("/ingest")
     # Should not be 404 (not found)
     assert response.status_code != 404

+import json
 import pytest
 from app import app as flask_app
     response = client.post("/ingest")
     # Should not be 404 (not found)
     assert response.status_code != 404
+class TestSearchEndpoint:
+    """Test cases for the /search endpoint"""
+    def test_search_endpoint_valid_request(self, client):
+        """Test search endpoint with valid request"""
+        request_data = {"query": "remote work policy", "top_k": 3, "threshold": 0.3}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 200
+        data = response.get_json()
+        assert data["status"] == "success"
+        assert data["query"] == "remote work policy"
+        assert "results_count" in data
+        assert "results" in data
+        assert isinstance(data["results"], list)
+    def test_search_endpoint_minimal_request(self, client):
+        """Test search endpoint with minimal request (only query)"""
+        request_data = {"query": "employee benefits"}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 200
+        data = response.get_json()
+        assert data["status"] == "success"
+        assert data["query"] == "employee benefits"
+    def test_search_endpoint_missing_query(self, client):
+        """Test search endpoint with missing query parameter"""
+        request_data = {"top_k": 5}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 400
+        data = response.get_json()
+        assert data["status"] == "error"
+        assert "Query parameter is required" in data["message"]
+    def test_search_endpoint_empty_query(self, client):
+        """Test search endpoint with empty query"""
+        request_data = {"query": ""}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 400
+        data = response.get_json()
+        assert data["status"] == "error"
+        assert "non-empty string" in data["message"]
+    def test_search_endpoint_invalid_top_k(self, client):
+        """Test search endpoint with invalid top_k parameter"""
+        request_data = {"query": "test query", "top_k": -1}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 400
+        data = response.get_json()
+        assert data["status"] == "error"
+        assert "positive integer" in data["message"]
+    def test_search_endpoint_invalid_threshold(self, client):
+        """Test search endpoint with invalid threshold parameter"""
+        request_data = {"query": "test query", "threshold": 1.5}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 400
+        data = response.get_json()
+        assert data["status"] == "error"
+        assert "between 0 and 1" in data["message"]
+    def test_search_endpoint_non_json_request(self, client):
+        """Test search endpoint with non-JSON request"""
+        response = client.post("/search", data="not json", content_type="text/plain")
+        assert response.status_code == 400
+        data = response.get_json()
+        assert data["status"] == "error"
+        assert "application/json" in data["message"]
+    def test_search_endpoint_result_structure(self, client):
+        """Test that search results have the correct structure"""
+        request_data = {"query": "policy"}
+        response = client.post(
+            "/search", data=json.dumps(request_data), content_type="application/json"
+        )
+        assert response.status_code == 200
+        data = response.get_json()
+        if data["results_count"] > 0:
+            result = data["results"][0]
+            assert "chunk_id" in result
+            assert "content" in result
+            assert "similarity_score" in result
+            assert "metadata" in result
+            assert isinstance(result["similarity_score"], (int, float))

tests/test_enhanced_app.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""
+Test for enhanced ingestion Flask endpoint
+"""
+import json
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+from app import app
+class TestEnhancedIngestionEndpoint(unittest.TestCase):
+    """Test cases for enhanced ingestion Flask endpoint"""
+    def setUp(self):
+        """Set up test fixtures"""
+        app.config["TESTING"] = True
+        self.app = app.test_client()
+        # Create temporary directory and files for testing
+        self.temp_dir = tempfile.mkdtemp()
+        self.test_dir = Path(self.temp_dir)
+        self.test_file = self.test_dir / "test.md"
+        self.test_file.write_text(
+            "# Test Document\n\nThis is test content for enhanced ingestion."
+        )
+    def test_ingest_endpoint_with_embeddings_default(self):
+        """Test ingestion endpoint with default embeddings enabled"""
+        with patch("src.config.CORPUS_DIRECTORY", str(self.test_dir)):
+            response = self.app.post("/ingest")
+            self.assertEqual(response.status_code, 200)
+            data = json.loads(response.data)
+            # Check enhanced response structure
+            self.assertEqual(data["status"], "success")
+            self.assertIn("chunks_processed", data)
+            self.assertIn("files_processed", data)
+            self.assertIn("embeddings_stored", data)
+            self.assertIn("store_embeddings", data)
+            self.assertTrue(data["store_embeddings"])  # Default is True
+            self.assertGreater(data["chunks_processed"], 0)
+            self.assertGreater(data["files_processed"], 0)
+    def test_ingest_endpoint_with_embeddings_disabled(self):
+        """Test ingestion endpoint with embeddings disabled"""
+        with patch("src.config.CORPUS_DIRECTORY", str(self.test_dir)):
+            response = self.app.post(
+                "/ingest",
+                data=json.dumps({"store_embeddings": False}),
+                content_type="application/json",
+            )
+            self.assertEqual(response.status_code, 200)
+            data = json.loads(response.data)
+            # Check response structure with embeddings disabled
+            self.assertEqual(data["status"], "success")
+            self.assertIn("chunks_processed", data)
+            self.assertIn("files_processed", data)
+            self.assertIn("embeddings_stored", data)
+            self.assertIn("store_embeddings", data)
+            self.assertFalse(data["store_embeddings"])
+            self.assertEqual(data["embeddings_stored"], 0)
+            self.assertGreater(data["chunks_processed"], 0)
+            self.assertGreater(data["files_processed"], 0)
+    def test_ingest_endpoint_with_no_json(self):
+        """Test ingestion endpoint with no JSON payload (should default to
+        embeddings enabled)"""
+        with patch("src.config.CORPUS_DIRECTORY", str(self.test_dir)):
+            response = self.app.post("/ingest")
+            self.assertEqual(response.status_code, 200)
+            data = json.loads(response.data)
+            # Should default to embeddings enabled
+            self.assertTrue(data["store_embeddings"])
+    def test_ingest_endpoint_error_handling(self):
+        """Test ingestion endpoint error handling"""
+        with patch("src.config.CORPUS_DIRECTORY", "/nonexistent/directory"):
+            response = self.app.post("/ingest")
+            self.assertEqual(response.status_code, 500)
+            data = json.loads(response.data)
+            self.assertEqual(data["status"], "error")
+            self.assertIn("message", data)
+    def tearDown(self):
+        """Clean up test fixtures"""
+        import shutil
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+if __name__ == "__main__":
+    unittest.main()

tests/test_ingestion/test_enhanced_ingestion_pipeline.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Tests for enhanced ingestion pipeline with embeddings
+"""
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import Mock, patch
+from src.ingestion.ingestion_pipeline import IngestionPipeline
+class TestEnhancedIngestionPipeline(unittest.TestCase):
+    """Test cases for enhanced IngestionPipeline with embeddings"""
+    def setUp(self):
+        """Set up test fixtures"""
+        self.temp_dir = tempfile.mkdtemp()
+        self.test_dir = Path(self.temp_dir)
+        # Create test files
+        self.test_file1 = self.test_dir / "test1.md"
+        self.test_file1.write_text(
+            "# Test Document 1\n\nThis is test content for document 1."
+        )
+        self.test_file2 = self.test_dir / "test2.txt"
+        self.test_file2.write_text("This is test content for document 2.")
+        # Create an unsupported file (should be skipped)
+        self.test_file3 = self.test_dir / "test3.pdf"
+        self.test_file3.write_text("PDF content")
+    def test_initialization_without_embeddings(self):
+        """Test pipeline initialization without embeddings"""
+        pipeline = IngestionPipeline(store_embeddings=False)
+        self.assertIsNotNone(pipeline.parser)
+        self.assertIsNotNone(pipeline.chunker)
+        self.assertFalse(pipeline.store_embeddings)
+        self.assertIsNone(pipeline.embedding_service)
+        self.assertIsNone(pipeline.vector_db)
+    def test_initialization_with_embeddings(self):
+        """Test pipeline initialization with embeddings"""
+        pipeline = IngestionPipeline(store_embeddings=True)
+        self.assertIsNotNone(pipeline.parser)
+        self.assertIsNotNone(pipeline.chunker)
+        self.assertTrue(pipeline.store_embeddings)
+        self.assertIsNotNone(pipeline.embedding_service)
+        self.assertIsNotNone(pipeline.vector_db)
+    def test_initialization_with_custom_components(self):
+        """Test pipeline initialization with custom embedding components"""
+        mock_embedding_service = Mock()
+        mock_vector_db = Mock()
+        pipeline = IngestionPipeline(
+            store_embeddings=True,
+            embedding_service=mock_embedding_service,
+            vector_db=mock_vector_db,
+        )
+        self.assertEqual(pipeline.embedding_service, mock_embedding_service)
+        self.assertEqual(pipeline.vector_db, mock_vector_db)
+    def test_process_directory_without_embeddings(self):
+        """Test directory processing without embeddings"""
+        pipeline = IngestionPipeline(store_embeddings=False)
+        result = pipeline.process_directory_with_embeddings(str(self.test_dir))
+        # Check response structure
+        self.assertIsInstance(result, dict)
+        self.assertEqual(result["status"], "success")
+        self.assertGreater(result["chunks_processed"], 0)
+        self.assertEqual(result["files_processed"], 2)  # Only .md and .txt files
+        self.assertEqual(result["embeddings_stored"], 0)
+        self.assertFalse(result["store_embeddings"])
+        self.assertIn("chunks", result)
+    @patch("src.ingestion.ingestion_pipeline.VectorDatabase")
+    @patch("src.ingestion.ingestion_pipeline.EmbeddingService")
+    def test_process_directory_with_embeddings(
+        self, mock_embedding_service_class, mock_vector_db_class
+    ):
+        """Test directory processing with embeddings"""
+        # Mock the classes to return mock instances
+        mock_embedding_service = Mock()
+        mock_vector_db = Mock()
+        mock_embedding_service_class.return_value = mock_embedding_service
+        mock_vector_db_class.return_value = mock_vector_db
+        # Configure mock embedding service
+        mock_embedding_service.embed_texts.return_value = [
+            [0.1, 0.2, 0.3],
+            [0.4, 0.5, 0.6],
+        ]
+        # Configure mock vector database
+        mock_vector_db.add_embeddings.return_value = True
+        pipeline = IngestionPipeline(store_embeddings=True)
+        result = pipeline.process_directory_with_embeddings(str(self.test_dir))
+        # Check response structure
+        self.assertIsInstance(result, dict)
+        self.assertEqual(result["status"], "success")
+        self.assertGreater(result["chunks_processed"], 0)
+        self.assertEqual(result["files_processed"], 2)
+        self.assertGreater(result["embeddings_stored"], 0)
+        self.assertTrue(result["store_embeddings"])
+        # Verify embedding service was called
+        mock_embedding_service.embed_texts.assert_called()
+        mock_vector_db.add_embeddings.assert_called()
+    def test_process_directory_nonexistent(self):
+        """Test processing non-existent directory"""
+        pipeline = IngestionPipeline(store_embeddings=False)
+        with self.assertRaises(FileNotFoundError):
+            pipeline.process_directory("/nonexistent/directory")
+    def test_store_embeddings_batch_without_components(self):
+        """Test batch embedding storage without embedding components"""
+        pipeline = IngestionPipeline(store_embeddings=False)
+        chunks = [
+            {
+                "content": "Test content 1",
+                "metadata": {"chunk_id": "test1", "source": "test1.txt"},
+            }
+        ]
+        result = pipeline._store_embeddings_batch(chunks)
+        self.assertEqual(result, 0)
+    @patch("src.ingestion.ingestion_pipeline.VectorDatabase")
+    @patch("src.ingestion.ingestion_pipeline.EmbeddingService")
+    def test_store_embeddings_batch_success(
+        self, mock_embedding_service_class, mock_vector_db_class
+    ):
+        """Test successful batch embedding storage"""
+        # Mock the classes to return mock instances
+        mock_embedding_service = Mock()
+        mock_vector_db = Mock()
+        mock_embedding_service_class.return_value = mock_embedding_service
+        mock_vector_db_class.return_value = mock_vector_db
+        # Configure mocks
+        mock_embedding_service.embed_texts.return_value = [
+            [0.1, 0.2, 0.3],
+            [0.4, 0.5, 0.6],
+        ]
+        mock_vector_db.add_embeddings.return_value = True
+        pipeline = IngestionPipeline(store_embeddings=True)
+        chunks = [
+            {
+                "content": "Test content 1",
+                "metadata": {"chunk_id": "test1", "source": "test1.txt"},
+            },
+            {
+                "content": "Test content 2",
+                "metadata": {"chunk_id": "test2", "source": "test2.txt"},
+            },
+        ]
+        result = pipeline._store_embeddings_batch(chunks)
+        self.assertEqual(result, 2)
+        # Verify method calls
+        mock_embedding_service.embed_texts.assert_called_once_with(
+            ["Test content 1", "Test content 2"]
+        )
+        mock_vector_db.add_embeddings.assert_called_once()
+    @patch("src.ingestion.ingestion_pipeline.VectorDatabase")
+    @patch("src.ingestion.ingestion_pipeline.EmbeddingService")
+    def test_store_embeddings_batch_error_handling(
+        self, mock_embedding_service_class, mock_vector_db_class
+    ):
+        """Test error handling in batch embedding storage"""
+        # Mock the classes to return mock instances
+        mock_embedding_service = Mock()
+        mock_vector_db = Mock()
+        mock_embedding_service_class.return_value = mock_embedding_service
+        mock_vector_db_class.return_value = mock_vector_db
+        # Configure embedding service to raise an error
+        mock_embedding_service.embed_texts.side_effect = Exception("Embedding error")
+        pipeline = IngestionPipeline(store_embeddings=True)
+        chunks = [
+            {
+                "content": "Test content 1",
+                "metadata": {"chunk_id": "test1", "source": "test1.txt"},
+            }
+        ]
+        # Should handle error gracefully and return 0
+        result = pipeline._store_embeddings_batch(chunks)
+        self.assertEqual(result, 0)
+    def test_backward_compatibility(self):
+        """Test that enhanced pipeline maintains backward compatibility"""
+        pipeline = IngestionPipeline(store_embeddings=False)
+        result = pipeline.process_directory(str(self.test_dir))
+        # Should return list for backward compatibility
+        self.assertIsInstance(result, list)
+        self.assertGreater(len(result), 0)
+        # First chunk should have expected structure
+        chunk = result[0]
+        self.assertIn("content", chunk)
+        self.assertIn("metadata", chunk)
+        self.assertIn("chunk_id", chunk["metadata"])
+    def tearDown(self):
+        """Clean up test fixtures"""
+        import shutil
+        shutil.rmtree(self.temp_dir, ignore_errors=True)
+if __name__ == "__main__":
+    unittest.main()