RAG_Eval / README.md
Rom89823974978's picture
Updated codebase
12409b1
metadata
title: Regulated Domain RAG Evaluation
emoji: πŸ€–
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 8501
pinned: false

Hugginface spaces setup

Retrieval-Augmented Generation Evaluation Framework

(Legal & Financial domains, with full regulatory-grade metrics and dashboard)

Project context – Implementation of the research proposal
β€œToward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”
Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.


1 Quick start

git clone https://github.com/Romainkul/rag_evaluation.git
cd rag_evaluation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pre-commit install

bash scripts/download_data.sh

python scripts/analysis.py \
  --config configs/kilt_hybrid_ce.yaml \
  --queries data/sample_queries.jsonl

The first call embeds documents, builds a FAISS dense index and a Pyserini sparse index; subsequent runs reuse them.


2 Repository layout

evaluation/                  ← Core library
β”œβ”€ config.py                 β€’ Typed dataclasses (retriever, generator, stats, reranker, logging)
β”œβ”€ pipeline.py               β€’ Retrieval β†’ (optional) re-rank β†’ generation
β”œβ”€ retrievers/               β€’ BM25, Dense (Sentence-Transformers + FAISS), Hybrid
β”œβ”€ rerankers/                β€’ Cross-encoder re-ranker
β”œβ”€ generators/               β€’ Hugging Face seq2seq wrapper
β”œβ”€ metrics/                  β€’ Retrieval, generation, composite RAG score
└─ stats/                    β€’ Correlation, significance, robustness utilities
scripts/                     ← CLI tools
β”œβ”€ prep_annotations.py       β€’ Runs RAG, and logs all outpus for expert annotations
β”œβ”€ analysis.py               β€’ **Grid runner** – all configs Γ— datasets, RQ1-RQ4 analysis
β”œβ”€ dashboard.py              β€’ **Streamlit dashboard** for interactive exploration
tests/                       ← PyTest tests
configs/                     ← YAML templates for pipelines & stats
.github/workflows/           ← Lint + tests CI
Dockerfile                   ← Slim reproducible image

3 Mapping code ↔ proposal tasks

Research-proposal element Code artefact Purpose
RQ1 Classical retrieval ↔ factual correctness evaluation/retrievers/, analysis.py Computes Spearman / Kendall ρ with CIs for MRR, MAP, P@k vs human_correct.
RQ2 Faithfulness metrics vs expert judgements evaluation/metrics/, evaluation/stats/, analysis.py Correlates QAGS, FactScore, RAGAS-F etc. with human_faithful; Wilcoxon + Holm.
RQ3 Error propagation β†’ hallucination evaluation/stats.robustness, analysis.py χ² test, conditional failure rates across corpora / document styles.
RQ4 Robustness to adversarial evidence Perturbed datasets (*_pert.jsonl) + analysis.py Ξ”-metrics & Cohen’s d between clean and perturbed runs.
Interactive analysis / decision-making scripts/dashboard.py Select dataset + configs, explore tables & plots instantly.
EU AI-Act traceability (Art. 14-15) Rotating file logging (evaluation/utils/logger.py), Docker, CI Full run provenance (config + log + results + stats) stored under outputs/.

4 Running a grid of experiments

# Evaluate three configs on two datasets, save everything under outputs/grid
python scripts/analysis.py \
  --configs configs/*.yaml \
  --datasets data/legal.jsonl data/finance.jsonl \
  --plots

Per dataset the script writes:

outputs/grid/<dataset>/<config>/
    results.jsonl          ← per-query outputs + metrics
    aggregates.yaml        ← mean metrics
    rq1.yaml … rq4.yaml    ← answers to each research question
    mrr_vs_correct.png     ← diagnostic scatter
outputs/grid/<dataset>/wilcoxon_rag_holm.yaml  ← pairwise p-values

Incremental mode

Run a single new config and automatically compare it to all previous ones:

python scripts/analysis.py \
  --configs configs/my_new.yaml \
  --datasets data/legal.jsonl \
  --outdir outputs/grid \
  --plots

5 Interactive dashboard

streamlit run scripts/dashboard.py

The UI lets you

  1. pick a dataset
  2. select any subset of configs
  3. view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs
  4. download raw results.jsonl for external analysis

6 Index generation details

  • Sparse (BM25 / Lucene) – If bm25_index is missing, BM25Retriever invokes Pyserini’s CLI to build it from doc_store JSONL ({"id","text"}).
  • Dense (FAISS) – DenseRetriever embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.

Both artefacts are cached, so the heavy work only happens once.


7 Example: manual statistical scripting

from evaluation.stats import corr_ci
from evaluation import StatsConfig
import json, pandas as pd

rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
cfg  = StatsConfig(n_boot=5000)

mrr  = [r["metrics"]["mrr"] for r in rows]
gold = [1 if r["human_correct"] else 0 for r in rows]

r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
print(f"Spearman ρ={r:.2f}  95%CI=({lo:.2f},{hi:.2f})  p={p:.3g}")

All statistical helpers rely only on NumPy & SciPy, so they run in the minimal Docker image.


Happy evaluating & dashboarding!

Questions or suggestions? Open an issue or start a discussion