File size: 4,281 Bytes
4aa7309
 
 
 
 
 
 
 
 
 
 
8ce2739
95a5cb0
8ce2739
95a5cb0
8ce2739
95a5cb0
8ce2739
1257c6e
 
 
 
e6ae8f8
8ce2739
 
 
 
 
 
 
 
 
e6ae8f8
8ce2739
 
 
 
e6ae8f8
8ce2739
 
 
e6ae8f8
1257c6e
 
 
 
 
 
 
 
8ce2739
 
e6ae8f8
 
8ce2739
 
 
 
 
 
 
 
63af196
8ce2739
 
e6ae8f8
8ce2739
e6ae8f8
 
8ce2739
 
 
 
 
 
 
 
 
 
 
 
e6ae8f8
 
8ce2739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1257c6e
 
 
8ce2739
 
 
 
 
 
 
95a5cb0
8ce2739
e6ae8f8
8ce2739
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: RAG Pipeline Demo
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.25.0"
app_file: app.py
pinned: false
---

# πŸŽ“ University Knowledge Retrieval System

A comprehensive Retrieval-Augmented Generation (RAG) system for university data with advanced guardrails and experimental validation.

## 🌟 Features

- **Interactive Chat Interface**: Natural language queries about university data
- **Advanced Guardrails**: Enhanced input/output security, including blocking queries with Austrian social security numbers (SVNRs) and redacting them from responses.
- **Experimental Dashboard**: Comprehensive testing suite for RAG validation.
- **Real Database**: 6,000+ students, 1,300+ faculty, 2,600+ courses, with student records now including SVNRs for guardrail testing.
- **Vector Search**: Semantic search using ChromaDB and Sentence Transformers, with SVNRs intentionally included in student documents for output guardrail testing.

## πŸš€ Quick Start

### Prerequisites
- Python 3.8+
- Hugging Face API Token

### Installation

1. **Clone and setup**:
```bash
git clone <repository-url>
cd projekt
pip install -r requirements.txt
```

2. **Configure API Key** (choose one):
   - Create `secrets_local.py`: `HF = "your_hugging_face_token"`
   - Or set environment variable: `HF_TOKEN=your_token`

3. **Initialize Database and Vector Store**:
   Run the setup scripts to create the SQLite database and populate the vector store.
```bash
python database/setup_db.py
python rag/build_vector_store.py
```

4. **Run the application**:
```bash
streamlit run app.py
```

## πŸ“Š System Components

### Chat Interface
- Natural language queries about university data
- Real-time RAG pipeline with source citations
- Input/output guardrails for security

### Experimental Dashboard
Two comprehensive test suites:
1. **Input Guardrails**: Tests against malicious inputs (SQL injection, PII extraction)
2. **Output Guardrails**: Validates response quality and detects hallucinations

## πŸ—οΈ Architecture

```
β”œβ”€β”€ app.py                    # Main Streamlit application
β”œβ”€β”€ experimental_dashboard.py # Experiment interface and system info
β”œβ”€β”€ experiments/             # Test suites for RAG validation
β”‚   β”œβ”€β”€ experiment_1_input_guardrails.py
β”‚   β”œβ”€β”€ experiment_2_output_guardrails.py
β”‚   β”œβ”€β”€ experiment_3_hyperparameters.py
β”‚   └── experiment_4_context_window.py
β”œβ”€β”€ database/               # SQLite university database
β”œβ”€β”€ rag/                   # Vector store and retrieval
β”œβ”€β”€ rails/                 # Input/output guardrails
β”œβ”€β”€ model/                 # RAG model integration
└── guards/               # Security components
```

## πŸ”§ Configuration

**Dependencies** (requirements.txt):
- streamlit==1.37.0
- sentence-transformers==5.1.0
- chromadb==1.0.21
- Faker==15.3.4 (for database generation)
- huggingface-hub==0.34.4
- nltk, numpy, scikit-learn

## 🎯 Usage Examples

**Student Queries**:
- "What courses is Maria taking?"
- "Who are the students in computer science?"

**Faculty Queries**:
- "Who teaches in the engineering department?"
- "Show me all professors"

**Course Queries**:
- "What courses are available?"
- "Who teaches advanced mathematics?"

## πŸ§ͺ Running Experiments

Access via the "Experiments" tab in the web interface, or run individually:

```bash
cd experiments
python experiment_1_input_guardrails.py
python experiment_2_output_guardrails.py
```

## πŸ”’ Security Features

- **Input Validation**: SQL injection prevention, malicious prompt detection, **blocking queries containing valid Austrian social security numbers (SVNRs)**.
- **Output Filtering**: PII redaction (including **SVNRs**), hallucination detection, relevance checking.
- **Content Sanitization**: Automatic cleaning of responses and database content.

## πŸ“ˆ Database Statistics

- **Students**: 6,398 records with realistic personal data
- **Faculty**: 1,297 professors across multiple departments
- **Courses**: 2,600 courses linked to faculty
- **Enrollments**: 19,443 student-course relationships

## πŸ”‘ API Requirements

Requires Hugging Face API access for:
- Text generation models
- Embedding models for semantic search
- Guardrail validation services