query-context-pruner-multilingual

A multilingual model designed to remove contexts irrelevant to queries. Since this model is a relatively large LLM, practical pruning methods like Provence are needed for speed-critical information retrieval tasks. This model is particularly well-suited for generating teacher labels for training such efficient pruning models.

We offer two model sizes: 4B model for accuracy-focused label generation where precision matters more than speed, and 1.7B model for high-speed label generation.

Why This Model is Valuable

Based on the Provence paper, context pruning addresses critical challenges in RAG systems:

Context Noise Problem: Retrieved documents often contain irrelevant sentences that mislead LLMs and degrade response quality
Computational Overhead: Long contexts slow down generation and increase costs significantly
Performance Degradation: Irrelevant information leads to hallucinations and poor answer quality

Our model enables efficient training data generation for lightweight context pruners that can:

Reduce context length by 70-90% while maintaining accuracy
Achieve almost zero-cost pruning when unified with reranking (as shown in Provence)
Work out-of-the-box across diverse domains and languages
Dynamically detect optimal pruning ratios (0-100%) per context

This addresses the practical need for robust, adaptable context pruners that can be deployed in production RAG systems without compromising performance.

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose model size
MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-4B"   # Recommended (best balance)
# MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-1.7B"  # Faster alternative

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


def create_user_prompt(query: str, contexts: list[str]) -> str:
    """Create a formatted prompt from query and context chunks."""
    context_str = "\n".join([f"[{i+1}] {ctx}" for i, ctx in enumerate(contexts)])
    return f"{query}\n---\n{context_str}"


# Example: Python documentation split into chunks
query = "How do you read and write files in Python?"
contexts = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "To read a file in Python, you use the open() function with the 'r' mode. For example: with open('file.txt', 'r') as f: content = f.read()",
    "Writing to files in Python also uses open() with 'w' mode for writing or 'a' mode for appending. Example: with open('file.txt', 'w') as f: f.write('Hello')",
    "Python supports multiple programming paradigms including object-oriented and functional programming.",
    "The 'with' statement ensures proper file handling by automatically closing files after use, preventing resource leaks.",
    "Python has various built-in functions like len(), range(), and type() that are commonly used in everyday programming."
]

# Generate response
prompt = create_user_prompt(query, contexts)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(f"Relevant chunks: {response}")  # Expected: "2,3,5" (reading, writing, and proper file handling)

# Note: The model's output indices start from 1, not 0

📋 Model Description

Query-Context Pruner is designed to solve the context overload problem in RAG systems by identifying which text chunks contain information relevant to answering a query. This enables:

Efficient RAG pipelines: Reduce context length by 70-90% while maintaining accuracy
Training data generation: Create high-quality labels for smaller bi-encoder models
Multilingual information retrieval: Works across 20 languages with varying performance
Context compression: Eliminate irrelevant information before passing to LLMs

🌍 Supported Languages

The model supports 20 languages with varying performance levels. Important: While some languages show excellent performance, others have limited effectiveness. Please carefully review your target language's performance before use.

Note on Evaluation: Many test datasets (especially MIRACL) have very small sample sizes, making test scores less reliable. We report averaged F1 scores (test + validation) for more robust evaluation.

Language Performance Summary (4B Model)

Language	Avg F1	Test F1	Val F1	Test Samples	Val Samples	Performance Tier
High Performance Languages (F1 > 0.85)
Russian (ru)	0.901	0.910	0.891	82	82	🏆 Excellent
Arabic (ar)	0.899	0.916	0.882	61	61	🏆 Excellent
Finnish (fi)	0.877	0.874	0.880	48	48	🏆 Excellent
Indonesian (id)	0.871	0.858	0.883	64	64	🏆 Excellent
Japanese (ja)*	0.865	0.860	0.870	478	478	🏆 Excellent
Good Performance Languages (F1 0.75-0.85)
English (en)*	0.847	0.874	0.819	682	682	✅ Very Good
Korean (ko)	0.849	0.910	0.787	30	30	✅ Very Good†
German (de)	0.843	0.776	0.909	14	14	✅ Very Good†
Persian (fa)	0.818	0.739	0.896	27	27	✅ Very Good†
Swahili (sw)	0.806	0.767	0.845	37	37	✅ Very Good
Chinese (zh)	0.805	0.777	0.833	38	38	✅ Good
Moderate Performance Languages (F1 0.60-0.75)
Spanish (es)	0.715	0.800	0.629	31	31	⚡ Moderate‡
Yoruba (yo)	0.662	0.630	0.694	18	18	⚡ Moderate†
French (fr)	0.622	0.774	0.470	29	29	⚡ Moderate‡
Limited Performance Languages (F1 < 0.60)
Thai (th)	0.378	0.239	0.516	56	56	⚠️ Poor
Hindi (hi)	0.348	0.538	0.158	36	36	⚠️ Poor‡
Bengali (bn)	0.217	0.222	0.212	23	23	⚠️ Very Poor
Telugu (te)	0.109	0.067	0.150	61	61	⚠️ Very Poor

* Average across multiple datasets (Japanese: 8 datasets, English: MS-MARCO + MIRACL)
† Very small test set - interpret with caution
‡ High variance between test/validation - less stable performance

Note: Performance varies by domain. Languages with more training data generally perform better.

📊 Model Evaluation

We evaluated four model sizes on 28 datasets covering multiple languages and domains. Results shown are combined test+validation scores for robustness.

Model Size Comparison

Model	Avg F1	Avg Exact Match	Relative Speed	Recommendation
4B	0.820	0.712	1.0x	Recommended
1.7B	0.794	0.666	1.5x	High-speed option
8B	0.827	0.727	0.6x	Maximum accuracy (not released)
0.6B	0.707	0.600	2.5x	Baseline (not released)

Performance Highlights

Why 4B Model is Recommended:

Only 0.9% F1 gap with 8B model (0.820 vs 0.827)
67% faster inference than 8B model
Perfect balance of accuracy and efficiency
Optimal for production deployments

Detailed Performance by Dataset (4B Model)

Dataset	Test F1	Val F1	Combined F1	Test/Val Samples	Notes
Japanese Datasets
jaquad	0.994	0.952	0.973	47/47	Excellent
mr-tydi-ja	0.939	0.791	0.865	55/55	Very good
jsquad	0.942	0.787	0.865	53/53	Very good
jqara	0.906	0.888	0.897	54/54	Excellent
mkqa-ja	0.679	0.823	0.751	55/55	Good
quiz-works	0.933	0.869	0.901	51/51	Excellent
quiz-no-mori	0.607	0.879	0.743	52/52	Good
ms-marco-ja	0.846	0.821	0.834	168/168	Very good
English Datasets
ms-marco-en	0.888	0.869	0.879	630/630	Very good
miracl-en	0.860	0.769	0.815	52/52	Very good†
MIRACL Languages
miracl-ar	0.916	0.882	0.899	61/61	Excellent†
miracl-ru	0.910	0.891	0.901	82/82	Excellent†
miracl-fi	0.874	0.880	0.877	48/48	Very good†
miracl-id	0.858	0.883	0.871	64/64	Very good†
miracl-ko	0.910	0.787	0.849	30/30	Very good†
miracl-fa	0.739	0.896	0.818	27/27	Very good†
miracl-de	0.776	0.909	0.843	14/14	Very good†
miracl-sw	0.767	0.845	0.806	37/37	Very good†
miracl-zh	0.777	0.833	0.805	38/38	Very good†
miracl-es	0.800	0.629	0.715	31/31	Good†
miracl-yo	0.630	0.694	0.662	18/18	Moderate†
miracl-fr	0.774	0.470	0.622	29/29	Moderate†
Low Performance
miracl-hi	0.538	0.158	0.348	36/36	Poor†
miracl-bn	0.222	0.212	0.217	23/23	Poor†
miracl-th	0.239	0.516	0.378	56/56	Poor†
miracl-te	0.067	0.150	0.109	61/61	Poor†
Special Cases
auto-wiki-qa-nemotron	0.571	0.912*	0.742	41/41	*Large split difference

Note: We show both test and validation scores because some datasets have small test sets, making validation scores more reliable. Large differences between splits (>0.1) may indicate data quality issues.

† Important: All MIRACL datasets have fewer than 100 test samples, with many having fewer than 50. During dataset creation, we should have allocated more samples to the test sets for MIRACL languages. The small sample sizes raise concerns about the statistical reliability of the reported metrics. Performance on these datasets should be interpreted with caution, as the confidence intervals are likely to be wide.

🎯 Model Selection Guide

Choose 4B (Recommended) if you need:

Optimal balance of accuracy and speed
Production deployment
F1 > 0.80 on major languages
Cost-effective inference
Best performance per computational cost

Choose 1.7B if you need:

Maximum speed for real-time applications
Resource-constrained environments
F1 > 0.75 is sufficient
Edge deployment scenarios

📚 Training Details

Dataset: qa-context-relevance-multilingual-140k
- 142,629 samples across 20 languages
- English: MS-MARCO dataset for robust English QA performance
- Japanese: Various Japanese QA datasets (JaQuAD, JSQuAD, JQARA, Quiz datasets) for comprehensive coverage
- Multilingual: MIRACL dataset covering 18 languages for broad multilingual support
- Important Note: Relevance labels were generated using DeepSeek-V3-0324, an LLM, without human verification. Therefore, the ground truth data itself may contain errors and should not be considered perfect annotations.
Base Model: Qwen-3 series
Method: Supervised Fine-Tuning (SFT)
Training Strategy: Multi-language curriculum learning

🔧 Technical Implementation

The model implements context pruning as described in:

Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation
Zhang et al., 2025 (arXiv:2501.16214)

Key features:

Citation-style output format (e.g., "1,3" for chunks 1 and 3)
Handles multiple relevant chunks
Language-agnostic prompt format
Optimized for chunk-level relevance

📄 License

Apache-2.0

👤 Author

Yuichi Tateno (@hotchpotch)

📖 Appendix

A. Dataset Creation Process

The training dataset was created through:

Collection of diverse QA datasets:
- MS-MARCO for English
- Japanese QA datasets (JaQuAD, JSQuAD, JQARA, MKQA-ja, Mr.TyDi-ja, Quiz-no-mori, Quiz-works)
- MIRACL for 18 additional languages
Text chunking using language-specific tools
Relevance annotation using DeepSeek-V3-0324 (LLM-based annotation)
Quality filtering and balancing

Note: All relevance annotations were generated by an LLM without human verification, which may introduce annotation errors.

B. Model Architecture

Base: Qwen-3 transformer architecture
Context Length: Standard context window
Special Tokens: Standard Qwen-3 chat template
Output Format: Citation numbers or "No answer"

C. Evaluation Methodology

We evaluated on both test and validation sets because:

Some datasets have small test sets (<100 samples)
Validation scores provide more stable estimates
Combined scores reduce evaluation noise
Split differences help identify data quality issues

D. Performance Comparison Across Model Sizes

D.1 Japanese Datasets Performance

Dataset	Model	Avg F1	Test F1	Val F1	Test Samples	Val Samples	Test EM	Val EM
ms-marco-ja
	0.6B	0.645	0.641	0.650	168	168	0.548	0.524
	1.7B	0.819	0.834	0.803	168	168	0.702	0.655
	4B	0.835	0.852	0.819	168	168	0.750	0.696
	8B	0.859	0.890	0.828	168	168	0.780	0.714
jaquad
	0.6B	0.926	0.948	0.904	56	56	0.911	0.839
	1.7B	0.972	0.978	0.966	56	56	0.929	0.911
	4B	0.970	0.994	0.946	56	56	0.982	0.893
	8B	0.988	1.000	0.975	56	56	1.000	0.946
mr-tydi-ja
	0.6B	0.830	0.850	0.810	55	55	0.818	0.691
	1.7B	0.867	0.892	0.842	55	55	0.818	0.691
	4B	0.915	0.921	0.908	55	55	0.873	0.782
	8B	0.880	0.870	0.891	55	55	0.818	0.764
jsquad
	0.6B	0.919	0.903	0.934	57	57	0.842	0.877
	1.7B	0.945	0.945	0.946	57	57	0.895	0.877
	4B	0.950	0.939	0.960	57	57	0.877	0.912
	8B	0.951	0.965	0.937	57	57	0.930	0.877
quiz-works
	0.6B	0.870	0.885	0.855	55	55	0.745	0.655
	1.7B	0.905	0.936	0.873	55	55	0.800	0.691
	4B	0.898	0.933	0.862	55	55	0.836	0.655
	8B	0.927	0.940	0.915	55	55	0.855	0.782

D.2 English Datasets Performance

Dataset	Model	Avg F1	Test F1	Val F1	Test Samples	Val Samples	Test EM	Val EM
ms-marco-en
	0.6B	0.765	0.771	0.759	630	630	0.657	0.646
	1.7B	0.853	0.859	0.847	630	630	0.737	0.722
	4B	0.883	0.889	0.877	630	630	0.784	0.775
	8B	0.901	0.904	0.897	630	630	0.798	0.798
miracl-en
	0.6B	0.714	0.715	0.713	50	50	0.580	0.640
	1.7B	0.791	0.785	0.797	50	50	0.600	0.660
	4B	0.830	0.871	0.789	50	50	0.760	0.660
	8B	0.832	0.834	0.830	50	50	0.720	0.700

D.3 MIRACL Languages Performance (Selected High/Low Performers)

Language	Model	Avg F1	Test F1	Val F1	Test Samples	Val Samples	Test EM	Val EM
Arabic (ar) - High Performer
	0.6B	0.791	0.772	0.811	61	61	0.656	0.656
	1.7B	0.876	0.880	0.872	61	61	0.738	0.721
	4B	0.897	0.917	0.877	61	61	0.770	0.754
	8B	0.903	0.904	0.902	61	61	0.803	0.803
Russian (ru) - High Performer
	0.6B	0.767	0.786	0.749	82	82	0.634	0.622
	1.7B	0.869	0.884	0.854	82	82	0.780	0.732
	4B	0.895	0.916	0.874	82	82	0.841	0.768
	8B	0.879	0.889	0.869	82	82	0.817	0.768
Finnish (fi) - High Performer
	0.6B	0.666	0.676	0.655	50	50	0.600	0.480
	1.7B	0.822	0.840	0.805	50	50	0.600	0.620
	4B	0.874	0.868	0.881	50	50	0.740	0.680
	8B	0.892	0.882	0.901	50	50	0.800	0.740
Hindi (hi) - Low Performer
	0.6B	0.458	0.447	0.469	29	29	0.379	0.379
	1.7B	0.455	0.476	0.435	29	29	0.448	0.379
	4B	0.495	0.538	0.451	29	29	0.483	0.379
	8B	0.431	0.476	0.387	29	29	0.448	0.345
Telugu (te) - Lowest Performer
	0.6B	0.206	0.261	0.150	60	60	0.250	0.150
	1.7B	0.218	0.286	0.150	60	60	0.267	0.150
	4B	0.228	0.306	0.150	60	60	0.283	0.150
	8B	0.233	0.317	0.150	60	60	0.317	0.150

Key Observations:

Diminishing Returns: The performance gain from 4B to 8B is minimal (avg 1-3% improvement) while computational cost increases significantly
Language-Specific Patterns: Some languages (e.g., Telugu, Hindi) show poor performance across all model sizes, indicating data quality issues
4B Sweet Spot: The 4B model achieves 95-98% of 8B performance at 60% of the computational cost
Small Sample Impact: Languages with <50 test samples show high variance between test/validation scores

Downloads last month: 25

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotchpotch/query-context-pruner-multilingual-Qwen3-4B

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B