File size: 4,281 Bytes
4aa7309 8ce2739 95a5cb0 8ce2739 95a5cb0 8ce2739 95a5cb0 8ce2739 1257c6e e6ae8f8 8ce2739 e6ae8f8 8ce2739 e6ae8f8 8ce2739 e6ae8f8 1257c6e 8ce2739 e6ae8f8 8ce2739 63af196 8ce2739 e6ae8f8 8ce2739 e6ae8f8 8ce2739 e6ae8f8 8ce2739 1257c6e 8ce2739 95a5cb0 8ce2739 e6ae8f8 8ce2739 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
title: RAG Pipeline Demo
emoji: π€
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.25.0"
app_file: app.py
pinned: false
---
# π University Knowledge Retrieval System
A comprehensive Retrieval-Augmented Generation (RAG) system for university data with advanced guardrails and experimental validation.
## π Features
- **Interactive Chat Interface**: Natural language queries about university data
- **Advanced Guardrails**: Enhanced input/output security, including blocking queries with Austrian social security numbers (SVNRs) and redacting them from responses.
- **Experimental Dashboard**: Comprehensive testing suite for RAG validation.
- **Real Database**: 6,000+ students, 1,300+ faculty, 2,600+ courses, with student records now including SVNRs for guardrail testing.
- **Vector Search**: Semantic search using ChromaDB and Sentence Transformers, with SVNRs intentionally included in student documents for output guardrail testing.
## π Quick Start
### Prerequisites
- Python 3.8+
- Hugging Face API Token
### Installation
1. **Clone and setup**:
```bash
git clone <repository-url>
cd projekt
pip install -r requirements.txt
```
2. **Configure API Key** (choose one):
- Create `secrets_local.py`: `HF = "your_hugging_face_token"`
- Or set environment variable: `HF_TOKEN=your_token`
3. **Initialize Database and Vector Store**:
Run the setup scripts to create the SQLite database and populate the vector store.
```bash
python database/setup_db.py
python rag/build_vector_store.py
```
4. **Run the application**:
```bash
streamlit run app.py
```
## π System Components
### Chat Interface
- Natural language queries about university data
- Real-time RAG pipeline with source citations
- Input/output guardrails for security
### Experimental Dashboard
Two comprehensive test suites:
1. **Input Guardrails**: Tests against malicious inputs (SQL injection, PII extraction)
2. **Output Guardrails**: Validates response quality and detects hallucinations
## ποΈ Architecture
```
βββ app.py # Main Streamlit application
βββ experimental_dashboard.py # Experiment interface and system info
βββ experiments/ # Test suites for RAG validation
β βββ experiment_1_input_guardrails.py
β βββ experiment_2_output_guardrails.py
β βββ experiment_3_hyperparameters.py
β βββ experiment_4_context_window.py
βββ database/ # SQLite university database
βββ rag/ # Vector store and retrieval
βββ rails/ # Input/output guardrails
βββ model/ # RAG model integration
βββ guards/ # Security components
```
## π§ Configuration
**Dependencies** (requirements.txt):
- streamlit==1.37.0
- sentence-transformers==5.1.0
- chromadb==1.0.21
- Faker==15.3.4 (for database generation)
- huggingface-hub==0.34.4
- nltk, numpy, scikit-learn
## π― Usage Examples
**Student Queries**:
- "What courses is Maria taking?"
- "Who are the students in computer science?"
**Faculty Queries**:
- "Who teaches in the engineering department?"
- "Show me all professors"
**Course Queries**:
- "What courses are available?"
- "Who teaches advanced mathematics?"
## π§ͺ Running Experiments
Access via the "Experiments" tab in the web interface, or run individually:
```bash
cd experiments
python experiment_1_input_guardrails.py
python experiment_2_output_guardrails.py
```
## π Security Features
- **Input Validation**: SQL injection prevention, malicious prompt detection, **blocking queries containing valid Austrian social security numbers (SVNRs)**.
- **Output Filtering**: PII redaction (including **SVNRs**), hallucination detection, relevance checking.
- **Content Sanitization**: Automatic cleaning of responses and database content.
## π Database Statistics
- **Students**: 6,398 records with realistic personal data
- **Faculty**: 1,297 professors across multiple departments
- **Courses**: 2,600 courses linked to faculty
- **Enrollments**: 19,443 student-course relationships
## π API Requirements
Requires Hugging Face API access for:
- Text generation models
- Embedding models for semantic search
- Guardrail validation services
|