metadata
title: RAG Pipeline Demo
emoji: π€
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.25.0
app_file: app.py
pinned: false
π University Knowledge Retrieval System
A comprehensive Retrieval-Augmented Generation (RAG) system for university data with advanced guardrails and experimental validation.
π Features
- Interactive Chat Interface: Natural language queries about university data
- Advanced Guardrails: Enhanced input/output security, including blocking queries with Austrian social security numbers (SVNRs) and redacting them from responses.
- Experimental Dashboard: Comprehensive testing suite for RAG validation.
- Real Database: 6,000+ students, 1,300+ faculty, 2,600+ courses, with student records now including SVNRs for guardrail testing.
- Vector Search: Semantic search using ChromaDB and Sentence Transformers, with SVNRs intentionally included in student documents for output guardrail testing.
π Quick Start
Prerequisites
- Python 3.8+
- Hugging Face API Token
Installation
- Clone and setup:
git clone <repository-url>
cd projekt
pip install -r requirements.txt
Configure API Key (choose one):
- Create
secrets_local.py:HF = "your_hugging_face_token" - Or set environment variable:
HF_TOKEN=your_token
- Create
Initialize Database and Vector Store: Run the setup scripts to create the SQLite database and populate the vector store.
python database/setup_db.py
python rag/build_vector_store.py
- Run the application:
streamlit run app.py
π System Components
Chat Interface
- Natural language queries about university data
- Real-time RAG pipeline with source citations
- Input/output guardrails for security
Experimental Dashboard
Two comprehensive test suites:
- Input Guardrails: Tests against malicious inputs (SQL injection, PII extraction)
- Output Guardrails: Validates response quality and detects hallucinations
ποΈ Architecture
βββ app.py # Main Streamlit application
βββ experimental_dashboard.py # Experiment interface and system info
βββ experiments/ # Test suites for RAG validation
β βββ experiment_1_input_guardrails.py
β βββ experiment_2_output_guardrails.py
β βββ experiment_3_hyperparameters.py
β βββ experiment_4_context_window.py
βββ database/ # SQLite university database
βββ rag/ # Vector store and retrieval
βββ rails/ # Input/output guardrails
βββ model/ # RAG model integration
βββ guards/ # Security components
π§ Configuration
Dependencies (requirements.txt):
- streamlit==1.37.0
- sentence-transformers==5.1.0
- chromadb==1.0.21
- Faker==15.3.4 (for database generation)
- huggingface-hub==0.34.4
- nltk, numpy, scikit-learn
π― Usage Examples
Student Queries:
- "What courses is Maria taking?"
- "Who are the students in computer science?"
Faculty Queries:
- "Who teaches in the engineering department?"
- "Show me all professors"
Course Queries:
- "What courses are available?"
- "Who teaches advanced mathematics?"
π§ͺ Running Experiments
Access via the "Experiments" tab in the web interface, or run individually:
cd experiments
python experiment_1_input_guardrails.py
python experiment_2_output_guardrails.py
π Security Features
- Input Validation: SQL injection prevention, malicious prompt detection, blocking queries containing valid Austrian social security numbers (SVNRs).
- Output Filtering: PII redaction (including SVNRs), hallucination detection, relevance checking.
- Content Sanitization: Automatic cleaning of responses and database content.
π Database Statistics
- Students: 6,398 records with realistic personal data
- Faculty: 1,297 professors across multiple departments
- Courses: 2,600 courses linked to faculty
- Enrollments: 19,443 student-course relationships
π API Requirements
Requires Hugging Face API access for:
- Text generation models
- Embedding models for semantic search
- Guardrail validation services