Spaces:

Rom89823974978
/

RAG_Eval

Sleeping

App Files Files Community

RAG_Eval / README.md

Rom89823974978

Updated codebase

12409b1 5 months ago

preview code

raw

history blame contribute delete

6.42 kB

metadata

title: Regulated Domain RAG Evaluation
emoji: 🤖
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 8501
pinned: false

Hugginface spaces setup

Retrieval-Augmented Generation Evaluation Framework

(Legal & Financial domains, with full regulatory-grade metrics and dashboard)

Project context – Implementation of the research proposal
“Toward Comprehensive Evaluation of Retrieval-Augmented Generation Systems in Regulated Domains.”
Each folder corresponds to a work-package: retrieval pipelines, metric library, robustness & statistical analysis, automation (CI + Docker), and an interactive dashboard.

1 Quick start

git clone https://github.com/Romainkul/rag_evaluation.git
cd rag_evaluation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pre-commit install

bash scripts/download_data.sh

python scripts/analysis.py \
  --config configs/kilt_hybrid_ce.yaml \
  --queries data/sample_queries.jsonl

The first call embeds documents, builds a FAISS dense index and a Pyserini sparse index; subsequent runs reuse them.

2 Repository layout

evaluation/                  ← Core library
├─ config.py                 • Typed dataclasses (retriever, generator, stats, reranker, logging)
├─ pipeline.py               • Retrieval → (optional) re-rank → generation
├─ retrievers/               • BM25, Dense (Sentence-Transformers + FAISS), Hybrid
├─ rerankers/                • Cross-encoder re-ranker
├─ generators/               • Hugging Face seq2seq wrapper
├─ metrics/                  • Retrieval, generation, composite RAG score
└─ stats/                    • Correlation, significance, robustness utilities
scripts/                     ← CLI tools
├─ prep_annotations.py       • Runs RAG, and logs all outpus for expert annotations
├─ analysis.py               • **Grid runner** – all configs × datasets, RQ1-RQ4 analysis
├─ dashboard.py              • **Streamlit dashboard** for interactive exploration
tests/                       ← PyTest tests
configs/                     ← YAML templates for pipelines & stats
.github/workflows/           ← Lint + tests CI
Dockerfile                   ← Slim reproducible image

3 Mapping code ↔ proposal tasks

Research-proposal element	Code artefact	Purpose
RQ1 Classical retrieval ↔ factual correctness	`evaluation/retrievers/`, `analysis.py`	Computes Spearman / Kendall ρ with CIs for MRR, MAP, P@k vs human_correct.
RQ2 Faithfulness metrics vs expert judgements	`evaluation/metrics/`, `evaluation/stats/`, `analysis.py`	Correlates QAGS, FactScore, RAGAS-F etc. with human_faithful; Wilcoxon + Holm.
RQ3 Error propagation → hallucination	`evaluation/stats.robustness`, `analysis.py`	χ² test, conditional failure rates across corpora / document styles.
RQ4 Robustness to adversarial evidence	Perturbed datasets (`*_pert.jsonl`) + `analysis.py`	Δ-metrics & Cohen’s d between clean and perturbed runs.
Interactive analysis / decision-making	`scripts/dashboard.py`	Select dataset + configs, explore tables & plots instantly.
EU AI-Act traceability (Art. 14-15)	Rotating file logging (`evaluation/utils/logger.py`), Docker, CI	Full run provenance (config + log + results + stats) stored under `outputs/`.

4 Running a grid of experiments

# Evaluate three configs on two datasets, save everything under outputs/grid
python scripts/analysis.py \
  --configs configs/*.yaml \
  --datasets data/legal.jsonl data/finance.jsonl \
  --plots

Per dataset the script writes:

outputs/grid/<dataset>/<config>/
    results.jsonl          ← per-query outputs + metrics
    aggregates.yaml        ← mean metrics
    rq1.yaml … rq4.yaml    ← answers to each research question
    mrr_vs_correct.png     ← diagnostic scatter
outputs/grid/<dataset>/wilcoxon_rag_holm.yaml  ← pairwise p-values

Incremental mode

Run a single new config and automatically compare it to all previous ones:

python scripts/analysis.py \
  --configs configs/my_new.yaml \
  --datasets data/legal.jsonl \
  --outdir outputs/grid \
  --plots

5 Interactive dashboard

streamlit run scripts/dashboard.py

The UI lets you

pick a dataset
select any subset of configs
view aggregated tables, bar/box/scatter plots, Wilcoxon tables, and RQ1–RQ4 YAMLs
download raw results.jsonl for external analysis

6 Index generation details

Sparse (BM25 / Lucene) – If bm25_index is missing, BM25Retriever invokes Pyserini’s CLI to build it from doc_store JSONL ({"id","text"}).
Dense (FAISS) – DenseRetriever embeds docs with the Sentence-Transformers model in the config, L2-normalises, and writes an IP-metric FAISS index.

Both artefacts are cached, so the heavy work only happens once.

7 Example: manual statistical scripting

from evaluation.stats import corr_ci
from evaluation import StatsConfig
import json, pandas as pd

rows = [json.loads(l) for l in open("outputs/grid/legal/hybrid/results.jsonl")]
cfg  = StatsConfig(n_boot=5000)

mrr  = [r["metrics"]["mrr"] for r in rows]
gold = [1 if r["human_correct"] else 0 for r in rows]

r,(lo,hi),p = corr_ci(mrr, gold, method=cfg.correlation_method, n_boot=cfg.n_boot)
print(f"Spearman ρ={r:.2f}  95%CI=({lo:.2f},{hi:.2f})  p={p:.3g}")

All statistical helpers rely only on NumPy & SciPy, so they run in the minimal Docker image.

Happy evaluating & dashboarding!

Questions or suggestions? Open an issue or start a discussion