--- title: Regulated Domain RAG Evaluation emoji: πŸ€– colorFrom: purple colorTo: indigo sdk: docker app_port: 8501 pinned: false --- Hugginface spaces setup # Retrieval-Augmented Generation Evaluation Framework *(Legal & Financial domains, with full regulatory-grade metrics and dashboard)* > **Project context** – Implementation of the research proposal > **β€œToward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”** > Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard. --- ## 1 Quick start ```bash git clone https://github.com/Romainkul/rag_evaluation.git cd rag_evaluation python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt pre-commit install bash scripts/download_data.sh python scripts/analysis.py \ --config configs/kilt_hybrid_ce.yaml \ --queries data/sample_queries.jsonl ```` The first call embeds documents, builds a **FAISS** dense index and a **Pyserini** sparse index; subsequent runs reuse them. --- ## 2 Repository layout ``` evaluation/ ← Core library β”œβ”€ config.py β€’ Typed dataclasses (retriever, generator, stats, reranker, logging) β”œβ”€ pipeline.py β€’ Retrieval β†’ (optional) re-rank β†’ generation β”œβ”€ retrievers/ β€’ BM25, Dense (Sentence-Transformers + FAISS), Hybrid β”œβ”€ rerankers/ β€’ Cross-encoder re-ranker β”œβ”€ generators/ β€’ Hugging Face seq2seq wrapper β”œβ”€ metrics/ β€’ Retrieval, generation, composite RAG score └─ stats/ β€’ Correlation, significance, robustness utilities scripts/ ← CLI tools β”œβ”€ prep_annotations.py β€’ Runs RAG, and logs all outpus for expert annotations β”œβ”€ analysis.py β€’ **Grid runner** – all configs Γ— datasets, RQ1-RQ4 analysis β”œβ”€ dashboard.py β€’ **Streamlit dashboard** for interactive exploration tests/ ← PyTest tests configs/ ← YAML templates for pipelines & stats .github/workflows/ ← Lint + tests CI Dockerfile ← Slim reproducible image ``` --- ## 3 Mapping code ↔ proposal tasks | Research-proposal element | Code artefact | Purpose | | ------------------------------------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------- | | **RQ1** Classical retrieval ↔ factual correctness | `evaluation/retrievers/`, `analysis.py` | Computes Spearman / Kendall ρ with CIs for MRR, MAP, P\@k vs *human\_correct*. | | **RQ2** Faithfulness metrics vs expert judgements | `evaluation/metrics/`, `evaluation/stats/`, `analysis.py` | Correlates QAGS, FactScore, RAGAS-F etc. with *human\_faithful*; Wilcoxon + Holm. | | **RQ3** Error propagation β†’ hallucination | `evaluation/stats.robustness`, `analysis.py` | χ² test, conditional failure rates across corpora / document styles. | | **RQ4** Robustness to adversarial evidence | Perturbed datasets (`*_pert.jsonl`) + `analysis.py` | Ξ”-metrics & Cohen’s *d* between clean and perturbed runs. | | Interactive analysis / decision-making | `scripts/dashboard.py` | Select dataset + configs, explore tables & plots instantly. | | EU AI-Act traceability (Art. 14-15) | Rotating file logging (`evaluation/utils/logger.py`), Docker, CI | Full run provenance (config + log + results + stats) stored under `outputs/`. | --- ## 4 Running a grid of experiments ```bash # Evaluate three configs on two datasets, save everything under outputs/grid python scripts/analysis.py \ --configs configs/*.yaml \ --datasets data/legal.jsonl data/finance.jsonl \ --plots ``` *Per dataset* the script writes: ``` outputs/grid/// results.jsonl ← per-query outputs + metrics aggregates.yaml ← mean metrics rq1.yaml … rq4.yaml ← answers to each research question mrr_vs_correct.png ← diagnostic scatter outputs/grid//wilcoxon_rag_holm.yaml ← pairwise p-values ``` ### Incremental mode Run a *single* new config and automatically compare it to all previous ones: ```bash python scripts/analysis.py \ --configs configs/my_new.yaml \ --datasets data/legal.jsonl \ --outdir outputs/grid \ --plots ``` --- ## 5 Interactive dashboard ```bash streamlit run scripts/dashboard.py ``` The UI lets you 1. pick a dataset 2. select any subset of configs 3. view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs 4. download raw `results.jsonl` for external analysis --- ## 6 Index generation details * **Sparse (BM25 / Lucene)** – If `bm25_index` is missing, `BM25Retriever` invokes Pyserini’s CLI to build it from `doc_store` JSONL (`{"id","text"}`). * **Dense (FAISS)** – `DenseRetriever` embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index. Both artefacts are cached, so the heavy work only happens once. --- ## 7 Example: manual statistical scripting ```python from evaluation.stats import corr_ci from evaluation import StatsConfig import json, pandas as pd rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")] cfg = StatsConfig(n_boot=5000) mrr = [r["metrics"]["mrr"] for r in rows] gold = [1 if r["human_correct"] else 0 for r in rows] r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot) print(f"Spearman ρ={r:.2f} 95%CI=({lo:.2f},{hi:.2f}) p={p:.3g}") ``` All statistical helpers rely only on **NumPy & SciPy**, so they run in the minimal Docker image. --- ### Happy evaluating & dashboarding! Questions or suggestions? Open an issue or start a discussion