Spaces:
Sleeping
title: Regulated Domain RAG Evaluation
emoji: π€
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 8501
pinned: false
Hugginface spaces setup
Retrieval-Augmented Generation Evaluation Framework
(Legal & Financial domains, with full regulatory-grade metrics and dashboard)
Project context β Implementation of the research proposal
βToward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.β
Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.
1 Quick start
git clone https://github.com/Romainkul/rag_evaluation.git
cd rag_evaluation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pre-commit install
bash scripts/download_data.sh
python scripts/analysis.py \
--config configs/kilt_hybrid_ce.yaml \
--queries data/sample_queries.jsonl
The first call embeds documents, builds a FAISS dense index and a Pyserini sparse index; subsequent runs reuse them.
2 Repository layout
evaluation/ β Core library
ββ config.py β’ Typed dataclasses (retriever, generator, stats, reranker, logging)
ββ pipeline.py β’ Retrieval β (optional) re-rank β generation
ββ retrievers/ β’ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
ββ rerankers/ β’ Cross-encoder re-ranker
ββ generators/ β’ Hugging Face seq2seq wrapper
ββ metrics/ β’ Retrieval, generation, composite RAG score
ββ stats/ β’ Correlation, significance, robustness utilities
scripts/ β CLI tools
ββ prep_annotations.py β’ Runs RAG, and logs all outpus for expert annotations
ββ analysis.py β’ **Grid runner** β all configs Γ datasets, RQ1-RQ4 analysis
ββ dashboard.py β’ **Streamlit dashboard** for interactive exploration
tests/ β PyTest tests
configs/ β YAML templates for pipelines & stats
.github/workflows/ β Lint + tests CI
Dockerfile β Slim reproducible image
3 Mapping code β proposal tasks
| Research-proposal element | Code artefact | Purpose |
|---|---|---|
| RQ1 Classical retrieval β factual correctness | evaluation/retrievers/, analysis.py |
Computes Spearman / Kendall Ο with CIs for MRR, MAP, P@k vs human_correct. |
| RQ2 Faithfulness metrics vs expert judgements | evaluation/metrics/, evaluation/stats/, analysis.py |
Correlates QAGS, FactScore, RAGAS-F etc. with human_faithful; Wilcoxon + Holm. |
| RQ3 Error propagation β hallucination | evaluation/stats.robustness, analysis.py |
ΟΒ² test, conditional failure rates across corpora / document styles. |
| RQ4 Robustness to adversarial evidence | Perturbed datasets (*_pert.jsonl) + analysis.py |
Ξ-metrics & Cohenβs d between clean and perturbed runs. |
| Interactive analysis / decision-making | scripts/dashboard.py |
Select dataset + configs, explore tables & plots instantly. |
| EU AI-Act traceability (Art. 14-15) | Rotating file logging (evaluation/utils/logger.py), Docker, CI |
Full run provenance (config + log + results + stats) stored under outputs/. |
4 Running a grid of experiments
# Evaluate three configs on two datasets, save everything under outputs/grid
python scripts/analysis.py \
--configs configs/*.yaml \
--datasets data/legal.jsonl data/finance.jsonl \
--plots
Per dataset the script writes:
outputs/grid/<dataset>/<config>/
results.jsonl β per-query outputs + metrics
aggregates.yaml β mean metrics
rq1.yaml β¦ rq4.yaml β answers to each research question
mrr_vs_correct.png β diagnostic scatter
outputs/grid/<dataset>/wilcoxon_rag_holm.yaml β pairwise p-values
Incremental mode
Run a single new config and automatically compare it to all previous ones:
python scripts/analysis.py \
--configs configs/my_new.yaml \
--datasets data/legal.jsonl \
--outdir outputs/grid \
--plots
5 Interactive dashboard
streamlit run scripts/dashboard.py
The UI lets you
- pick a dataset
- select any subset of configs
- view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1βRQ4 YAMLs
- download raw
results.jsonlfor external analysis
6 Index generation details
- Sparse (BM25 / Lucene) β If
bm25_indexis missing,BM25Retrieverinvokes Pyseriniβs CLI to build it fromdoc_storeJSONL ({"id","text"}). - Dense (FAISS) β
DenseRetrieverembeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.
Both artefacts are cached, so the heavy work only happens once.
7 Example: manual statistical scripting
from evaluation.stats import corr_ci
from evaluation import StatsConfig
import json, pandas as pd
rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
cfg = StatsConfig(n_boot=5000)
mrr = [r["metrics"]["mrr"] for r in rows]
gold = [1 if r["human_correct"] else 0 for r in rows]
r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
print(f"Spearman Ο={r:.2f} 95%CI=({lo:.2f},{hi:.2f}) p={p:.3g}")
All statistical helpers rely only on NumPy & SciPy, so they run in the minimal Docker image.
Happy evaluating & dashboarding!
Questions or suggestions? Open an issue or start a discussion