---
title: Regulated Domain RAG Evaluation   
emoji: 🤖                         
colorFrom: purple                 
colorTo: indigo                   
sdk: docker                       
app_port: 8501                    
pinned: false
---

Hugginface spaces setup

# Retrieval-Augmented Generation Evaluation Framework  
*(Legal & Financial domains, with full regulatory-grade metrics and dashboard)*

> **Project context** – Implementation of the research proposal  
> **“Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”**  
> Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.

---

## 1  Quick start

```bash
git clone https://github.com/Romainkul/rag_evaluation.git
cd rag_evaluation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pre-commit install

bash scripts/download_data.sh

python scripts/analysis.py \
  --config configs/kilt_hybrid_ce.yaml \
  --queries data/sample_queries.jsonl
````

The first call embeds documents, builds a **FAISS** dense index and a **Pyserini** sparse index; subsequent runs reuse them.

---

## 2  Repository layout

```
evaluation/                  ← Core library
├─ config.py                 • Typed dataclasses (retriever, generator, stats, reranker, logging)
├─ pipeline.py               • Retrieval → (optional) re-rank → generation
├─ retrievers/               • BM25, Dense (Sentence-Transformers + FAISS), Hybrid
├─ rerankers/                • Cross-encoder re-ranker
├─ generators/               • Hugging Face seq2seq wrapper
├─ metrics/                  • Retrieval, generation, composite RAG score
└─ stats/                    • Correlation, significance, robustness utilities
scripts/                     ← CLI tools
├─ prep_annotations.py       • Runs RAG, and logs all outpus for expert annotations
├─ analysis.py               • **Grid runner** – all configs × datasets, RQ1-RQ4 analysis
├─ dashboard.py              • **Streamlit dashboard** for interactive exploration
tests/                       ← PyTest tests
configs/                     ← YAML templates for pipelines & stats
.github/workflows/           ← Lint + tests CI
Dockerfile                   ← Slim reproducible image
```

---

## 3  Mapping code ↔ proposal tasks

| Research-proposal element                         | Code artefact                                                    | Purpose                                                                           |
| ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `analysis.py`              | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*.    |
| **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, `analysis.py`          | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. |
| **RQ3** Error propagation → hallucination         | `evaluation/stats.robustness`, `analysis.py`                       | χ² test, conditional failure rates across corpora / document styles.              |
| **RQ4** Robustness to adversarial evidence        | Perturbed datasets (`*_pert.jsonl`) + `analysis.py`                | Δ-metrics & Cohen’s *d* between clean and perturbed runs.                         |
| Interactive analysis / decision-making            | `scripts/dashboard.py`                                           | Select dataset + configs, explore tables & plots instantly.                       |
| EU AI-Act traceability (Art. 14-15)               | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`.     |

---

## 4  Running a grid of experiments

```bash
# Evaluate three configs on two datasets, save everything under outputs/grid
python scripts/analysis.py \
  --configs configs/*.yaml \
  --datasets data/legal.jsonl data/finance.jsonl \
  --plots
```

*Per dataset* the script writes:

```
outputs/grid/<dataset>/<config>/
    results.jsonl          ← per-query outputs + metrics
    aggregates.yaml        ← mean metrics
    rq1.yaml … rq4.yaml    ← answers to each research question
    mrr_vs_correct.png     ← diagnostic scatter
outputs/grid/<dataset>/wilcoxon_rag_holm.yaml  ← pairwise p-values
```

### Incremental mode

Run a *single* new config and automatically compare it to all previous ones:

```bash
python scripts/analysis.py \
  --configs configs/my_new.yaml \
  --datasets data/legal.jsonl \
  --outdir outputs/grid \
  --plots
```

---

## 5  Interactive dashboard

```bash
streamlit run scripts/dashboard.py
```

The UI lets you

1. pick a dataset
2. select any subset of configs
3. view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs
4. download raw `results.jsonl` for external analysis

---

## 6  Index generation details

* **Sparse (BM25 / Lucene)** – If `bm25_index` is missing, `BM25Retriever` invokes Pyserini’s CLI to build it from `doc_store` JSONL (`{"id","text"}`).
* **Dense (FAISS)** – `DenseRetriever` embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.

Both artefacts are cached, so the heavy work only happens once.

---

## 7  Example: manual statistical scripting

```python
from evaluation.stats import corr_ci
from evaluation import StatsConfig
import json, pandas as pd

rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
cfg  = StatsConfig(n_boot=5000)

mrr  = [r["metrics"]["mrr"] for r in rows]
gold = [1 if r["human_correct"] else 0 for r in rows]

r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
print(f"Spearman ρ={r:.2f}  95%CI=({lo:.2f},{hi:.2f})  p={p:.3g}")
```

All statistical helpers rely only on **NumPy & SciPy**, so they run in the minimal Docker image.

---

### Happy evaluating & dashboarding!

Questions or suggestions? Open an issue or start a discussion