query-context-pruner-multilingual
A multilingual model designed to remove contexts irrelevant to queries. Since this model is a relatively large LLM, practical pruning methods like Provence are needed for speed-critical information retrieval tasks. This model is particularly well-suited for generating teacher labels for training such efficient pruning models.
We offer two model sizes: 4B model for accuracy-focused label generation where precision matters more than speed, and 1.7B model for high-speed label generation.
Why This Model is Valuable
Based on the Provence paper, context pruning addresses critical challenges in RAG systems:
- Context Noise Problem: Retrieved documents often contain irrelevant sentences that mislead LLMs and degrade response quality
- Computational Overhead: Long contexts slow down generation and increase costs significantly
- Performance Degradation: Irrelevant information leads to hallucinations and poor answer quality
Our model enables efficient training data generation for lightweight context pruners that can:
- Reduce context length by 70-90% while maintaining accuracy
- Achieve almost zero-cost pruning when unified with reranking (as shown in Provence)
- Work out-of-the-box across diverse domains and languages
- Dynamically detect optimal pruning ratios (0-100%) per context
This addresses the practical need for robust, adaptable context pruners that can be deployed in production RAG systems without compromising performance.
π Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Choose model size
MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-4B" # Recommended (best balance)
# MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-1.7B" # Faster alternative
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def create_user_prompt(query: str, contexts: list[str]) -> str:
"""Create a formatted prompt from query and context chunks."""
context_str = "\n".join([f"[{i+1}] {ctx}" for i, ctx in enumerate(contexts)])
return f"{query}\n---\n{context_str}"
# Example: Python documentation split into chunks
query = "How do you read and write files in Python?"
contexts = [
"Python is a high-level programming language known for its simplicity and readability.",
"To read a file in Python, you use the open() function with the 'r' mode. For example: with open('file.txt', 'r') as f: content = f.read()",
"Writing to files in Python also uses open() with 'w' mode for writing or 'a' mode for appending. Example: with open('file.txt', 'w') as f: f.write('Hello')",
"Python supports multiple programming paradigms including object-oriented and functional programming.",
"The 'with' statement ensures proper file handling by automatically closing files after use, preventing resource leaks.",
"Python has various built-in functions like len(), range(), and type() that are commonly used in everyday programming."
]
# Generate response
prompt = create_user_prompt(query, contexts)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(f"Relevant chunks: {response}") # Expected: "2,3,5" (reading, writing, and proper file handling)
# Note: The model's output indices start from 1, not 0
π Model Description
Query-Context Pruner is designed to solve the context overload problem in RAG systems by identifying which text chunks contain information relevant to answering a query. This enables:
- Efficient RAG pipelines: Reduce context length by 70-90% while maintaining accuracy
- Training data generation: Create high-quality labels for smaller bi-encoder models
- Multilingual information retrieval: Works across 20 languages with varying performance
- Context compression: Eliminate irrelevant information before passing to LLMs
π Supported Languages
The model supports 20 languages with varying performance levels. Important: While some languages show excellent performance, others have limited effectiveness. Please carefully review your target language's performance before use.
Note on Evaluation: Many test datasets (especially MIRACL) have very small sample sizes, making test scores less reliable. We report averaged F1 scores (test + validation) for more robust evaluation.
Language Performance Summary (4B Model)
| Language | Avg F1 | Test F1 | Val F1 | Test Samples | Val Samples | Performance Tier |
|---|---|---|---|---|---|---|
| High Performance Languages (F1 > 0.85) | ||||||
| Russian (ru) | 0.901 | 0.910 | 0.891 | 82 | 82 | π Excellent |
| Arabic (ar) | 0.899 | 0.916 | 0.882 | 61 | 61 | π Excellent |
| Finnish (fi) | 0.877 | 0.874 | 0.880 | 48 | 48 | π Excellent |
| Indonesian (id) | 0.871 | 0.858 | 0.883 | 64 | 64 | π Excellent |
| Japanese (ja)* | 0.865 | 0.860 | 0.870 | 478 | 478 | π Excellent |
| Good Performance Languages (F1 0.75-0.85) | ||||||
| English (en)* | 0.847 | 0.874 | 0.819 | 682 | 682 | β Very Good |
| Korean (ko) | 0.849 | 0.910 | 0.787 | 30 | 30 | β Very Goodβ |
| German (de) | 0.843 | 0.776 | 0.909 | 14 | 14 | β Very Goodβ |
| Persian (fa) | 0.818 | 0.739 | 0.896 | 27 | 27 | β Very Goodβ |
| Swahili (sw) | 0.806 | 0.767 | 0.845 | 37 | 37 | β Very Good |
| Chinese (zh) | 0.805 | 0.777 | 0.833 | 38 | 38 | β Good |
| Moderate Performance Languages (F1 0.60-0.75) | ||||||
| Spanish (es) | 0.715 | 0.800 | 0.629 | 31 | 31 | β‘ Moderateβ‘ |
| Yoruba (yo) | 0.662 | 0.630 | 0.694 | 18 | 18 | β‘ Moderateβ |
| French (fr) | 0.622 | 0.774 | 0.470 | 29 | 29 | β‘ Moderateβ‘ |
| Limited Performance Languages (F1 < 0.60) | ||||||
| Thai (th) | 0.378 | 0.239 | 0.516 | 56 | 56 | β οΈ Poor |
| Hindi (hi) | 0.348 | 0.538 | 0.158 | 36 | 36 | β οΈ Poorβ‘ |
| Bengali (bn) | 0.217 | 0.222 | 0.212 | 23 | 23 | β οΈ Very Poor |
| Telugu (te) | 0.109 | 0.067 | 0.150 | 61 | 61 | β οΈ Very Poor |
* Average across multiple datasets (Japanese: 8 datasets, English: MS-MARCO + MIRACL)
β Very small test set - interpret with caution
β‘ High variance between test/validation - less stable performance
Note: Performance varies by domain. Languages with more training data generally perform better.
π Model Evaluation
We evaluated four model sizes on 28 datasets covering multiple languages and domains. Results shown are combined test+validation scores for robustness.
Model Size Comparison
| Model | Avg F1 | Avg Exact Match | Relative Speed | Recommendation |
|---|---|---|---|---|
| 4B | 0.820 | 0.712 | 1.0x | Recommended |
| 1.7B | 0.794 | 0.666 | 1.5x | High-speed option |
| 8B | 0.827 | 0.727 | 0.6x | Maximum accuracy (not released) |
| 0.6B | 0.707 | 0.600 | 2.5x | Baseline (not released) |
Performance Highlights
Why 4B Model is Recommended:
- Only 0.9% F1 gap with 8B model (0.820 vs 0.827)
- 67% faster inference than 8B model
- Perfect balance of accuracy and efficiency
- Optimal for production deployments
Detailed Performance by Dataset (4B Model)
| Dataset | Test F1 | Val F1 | Combined F1 | Test/Val Samples | Notes |
|---|---|---|---|---|---|
| Japanese Datasets | |||||
| jaquad | 0.994 | 0.952 | 0.973 | 47/47 | Excellent |
| mr-tydi-ja | 0.939 | 0.791 | 0.865 | 55/55 | Very good |
| jsquad | 0.942 | 0.787 | 0.865 | 53/53 | Very good |
| jqara | 0.906 | 0.888 | 0.897 | 54/54 | Excellent |
| mkqa-ja | 0.679 | 0.823 | 0.751 | 55/55 | Good |
| quiz-works | 0.933 | 0.869 | 0.901 | 51/51 | Excellent |
| quiz-no-mori | 0.607 | 0.879 | 0.743 | 52/52 | Good |
| ms-marco-ja | 0.846 | 0.821 | 0.834 | 168/168 | Very good |
| English Datasets | |||||
| ms-marco-en | 0.888 | 0.869 | 0.879 | 630/630 | Very good |
| miracl-en | 0.860 | 0.769 | 0.815 | 52/52 | Very goodβ |
| MIRACL Languages | |||||
| miracl-ar | 0.916 | 0.882 | 0.899 | 61/61 | Excellentβ |
| miracl-ru | 0.910 | 0.891 | 0.901 | 82/82 | Excellentβ |
| miracl-fi | 0.874 | 0.880 | 0.877 | 48/48 | Very goodβ |
| miracl-id | 0.858 | 0.883 | 0.871 | 64/64 | Very goodβ |
| miracl-ko | 0.910 | 0.787 | 0.849 | 30/30 | Very goodβ |
| miracl-fa | 0.739 | 0.896 | 0.818 | 27/27 | Very goodβ |
| miracl-de | 0.776 | 0.909 | 0.843 | 14/14 | Very goodβ |
| miracl-sw | 0.767 | 0.845 | 0.806 | 37/37 | Very goodβ |
| miracl-zh | 0.777 | 0.833 | 0.805 | 38/38 | Very goodβ |
| miracl-es | 0.800 | 0.629 | 0.715 | 31/31 | Goodβ |
| miracl-yo | 0.630 | 0.694 | 0.662 | 18/18 | Moderateβ |
| miracl-fr | 0.774 | 0.470 | 0.622 | 29/29 | Moderateβ |
| Low Performance | |||||
| miracl-hi | 0.538 | 0.158 | 0.348 | 36/36 | Poorβ |
| miracl-bn | 0.222 | 0.212 | 0.217 | 23/23 | Poorβ |
| miracl-th | 0.239 | 0.516 | 0.378 | 56/56 | Poorβ |
| miracl-te | 0.067 | 0.150 | 0.109 | 61/61 | Poorβ |
| Special Cases | |||||
| auto-wiki-qa-nemotron | 0.571 | 0.912* | 0.742 | 41/41 | *Large split difference |
Note: We show both test and validation scores because some datasets have small test sets, making validation scores more reliable. Large differences between splits (>0.1) may indicate data quality issues.
β Important: All MIRACL datasets have fewer than 100 test samples, with many having fewer than 50. During dataset creation, we should have allocated more samples to the test sets for MIRACL languages. The small sample sizes raise concerns about the statistical reliability of the reported metrics. Performance on these datasets should be interpreted with caution, as the confidence intervals are likely to be wide.
π― Model Selection Guide
Choose 4B (Recommended) if you need:
- Optimal balance of accuracy and speed
- Production deployment
- F1 > 0.80 on major languages
- Cost-effective inference
- Best performance per computational cost
Choose 1.7B if you need:
- Maximum speed for real-time applications
- Resource-constrained environments
- F1 > 0.75 is sufficient
- Edge deployment scenarios
π Training Details
- Dataset: qa-context-relevance-multilingual-140k
- 142,629 samples across 20 languages
- English: MS-MARCO dataset for robust English QA performance
- Japanese: Various Japanese QA datasets (JaQuAD, JSQuAD, JQARA, Quiz datasets) for comprehensive coverage
- Multilingual: MIRACL dataset covering 18 languages for broad multilingual support
- Important Note: Relevance labels were generated using DeepSeek-V3-0324, an LLM, without human verification. Therefore, the ground truth data itself may contain errors and should not be considered perfect annotations.
- Base Model: Qwen-3 series
- Method: Supervised Fine-Tuning (SFT)
- Training Strategy: Multi-language curriculum learning
π§ Technical Implementation
The model implements context pruning as described in:
Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation
Zhang et al., 2025 (arXiv:2501.16214)
Key features:
- Citation-style output format (e.g., "1,3" for chunks 1 and 3)
- Handles multiple relevant chunks
- Language-agnostic prompt format
- Optimized for chunk-level relevance
π License
Apache-2.0
π€ Author
Yuichi Tateno (@hotchpotch)
π Appendix
A. Dataset Creation Process
The training dataset was created through:
- Collection of diverse QA datasets:
- MS-MARCO for English
- Japanese QA datasets (JaQuAD, JSQuAD, JQARA, MKQA-ja, Mr.TyDi-ja, Quiz-no-mori, Quiz-works)
- MIRACL for 18 additional languages
- Text chunking using language-specific tools
- Relevance annotation using DeepSeek-V3-0324 (LLM-based annotation)
- Quality filtering and balancing
Note: All relevance annotations were generated by an LLM without human verification, which may introduce annotation errors.
B. Model Architecture
- Base: Qwen-3 transformer architecture
- Context Length: Standard context window
- Special Tokens: Standard Qwen-3 chat template
- Output Format: Citation numbers or "No answer"
C. Evaluation Methodology
We evaluated on both test and validation sets because:
- Some datasets have small test sets (<100 samples)
- Validation scores provide more stable estimates
- Combined scores reduce evaluation noise
- Split differences help identify data quality issues
D. Performance Comparison Across Model Sizes
D.1 Japanese Datasets Performance
| Dataset | Model | Avg F1 | Test F1 | Val F1 | Test Samples | Val Samples | Test EM | Val EM |
|---|---|---|---|---|---|---|---|---|
| ms-marco-ja | ||||||||
| 0.6B | 0.645 | 0.641 | 0.650 | 168 | 168 | 0.548 | 0.524 | |
| 1.7B | 0.819 | 0.834 | 0.803 | 168 | 168 | 0.702 | 0.655 | |
| 4B | 0.835 | 0.852 | 0.819 | 168 | 168 | 0.750 | 0.696 | |
| 8B | 0.859 | 0.890 | 0.828 | 168 | 168 | 0.780 | 0.714 | |
| jaquad | ||||||||
| 0.6B | 0.926 | 0.948 | 0.904 | 56 | 56 | 0.911 | 0.839 | |
| 1.7B | 0.972 | 0.978 | 0.966 | 56 | 56 | 0.929 | 0.911 | |
| 4B | 0.970 | 0.994 | 0.946 | 56 | 56 | 0.982 | 0.893 | |
| 8B | 0.988 | 1.000 | 0.975 | 56 | 56 | 1.000 | 0.946 | |
| mr-tydi-ja | ||||||||
| 0.6B | 0.830 | 0.850 | 0.810 | 55 | 55 | 0.818 | 0.691 | |
| 1.7B | 0.867 | 0.892 | 0.842 | 55 | 55 | 0.818 | 0.691 | |
| 4B | 0.915 | 0.921 | 0.908 | 55 | 55 | 0.873 | 0.782 | |
| 8B | 0.880 | 0.870 | 0.891 | 55 | 55 | 0.818 | 0.764 | |
| jsquad | ||||||||
| 0.6B | 0.919 | 0.903 | 0.934 | 57 | 57 | 0.842 | 0.877 | |
| 1.7B | 0.945 | 0.945 | 0.946 | 57 | 57 | 0.895 | 0.877 | |
| 4B | 0.950 | 0.939 | 0.960 | 57 | 57 | 0.877 | 0.912 | |
| 8B | 0.951 | 0.965 | 0.937 | 57 | 57 | 0.930 | 0.877 | |
| quiz-works | ||||||||
| 0.6B | 0.870 | 0.885 | 0.855 | 55 | 55 | 0.745 | 0.655 | |
| 1.7B | 0.905 | 0.936 | 0.873 | 55 | 55 | 0.800 | 0.691 | |
| 4B | 0.898 | 0.933 | 0.862 | 55 | 55 | 0.836 | 0.655 | |
| 8B | 0.927 | 0.940 | 0.915 | 55 | 55 | 0.855 | 0.782 |
D.2 English Datasets Performance
| Dataset | Model | Avg F1 | Test F1 | Val F1 | Test Samples | Val Samples | Test EM | Val EM |
|---|---|---|---|---|---|---|---|---|
| ms-marco-en | ||||||||
| 0.6B | 0.765 | 0.771 | 0.759 | 630 | 630 | 0.657 | 0.646 | |
| 1.7B | 0.853 | 0.859 | 0.847 | 630 | 630 | 0.737 | 0.722 | |
| 4B | 0.883 | 0.889 | 0.877 | 630 | 630 | 0.784 | 0.775 | |
| 8B | 0.901 | 0.904 | 0.897 | 630 | 630 | 0.798 | 0.798 | |
| miracl-en | ||||||||
| 0.6B | 0.714 | 0.715 | 0.713 | 50 | 50 | 0.580 | 0.640 | |
| 1.7B | 0.791 | 0.785 | 0.797 | 50 | 50 | 0.600 | 0.660 | |
| 4B | 0.830 | 0.871 | 0.789 | 50 | 50 | 0.760 | 0.660 | |
| 8B | 0.832 | 0.834 | 0.830 | 50 | 50 | 0.720 | 0.700 |
D.3 MIRACL Languages Performance (Selected High/Low Performers)
| Language | Model | Avg F1 | Test F1 | Val F1 | Test Samples | Val Samples | Test EM | Val EM |
|---|---|---|---|---|---|---|---|---|
| Arabic (ar) - High Performer | ||||||||
| 0.6B | 0.791 | 0.772 | 0.811 | 61 | 61 | 0.656 | 0.656 | |
| 1.7B | 0.876 | 0.880 | 0.872 | 61 | 61 | 0.738 | 0.721 | |
| 4B | 0.897 | 0.917 | 0.877 | 61 | 61 | 0.770 | 0.754 | |
| 8B | 0.903 | 0.904 | 0.902 | 61 | 61 | 0.803 | 0.803 | |
| Russian (ru) - High Performer | ||||||||
| 0.6B | 0.767 | 0.786 | 0.749 | 82 | 82 | 0.634 | 0.622 | |
| 1.7B | 0.869 | 0.884 | 0.854 | 82 | 82 | 0.780 | 0.732 | |
| 4B | 0.895 | 0.916 | 0.874 | 82 | 82 | 0.841 | 0.768 | |
| 8B | 0.879 | 0.889 | 0.869 | 82 | 82 | 0.817 | 0.768 | |
| Finnish (fi) - High Performer | ||||||||
| 0.6B | 0.666 | 0.676 | 0.655 | 50 | 50 | 0.600 | 0.480 | |
| 1.7B | 0.822 | 0.840 | 0.805 | 50 | 50 | 0.600 | 0.620 | |
| 4B | 0.874 | 0.868 | 0.881 | 50 | 50 | 0.740 | 0.680 | |
| 8B | 0.892 | 0.882 | 0.901 | 50 | 50 | 0.800 | 0.740 | |
| Hindi (hi) - Low Performer | ||||||||
| 0.6B | 0.458 | 0.447 | 0.469 | 29 | 29 | 0.379 | 0.379 | |
| 1.7B | 0.455 | 0.476 | 0.435 | 29 | 29 | 0.448 | 0.379 | |
| 4B | 0.495 | 0.538 | 0.451 | 29 | 29 | 0.483 | 0.379 | |
| 8B | 0.431 | 0.476 | 0.387 | 29 | 29 | 0.448 | 0.345 | |
| Telugu (te) - Lowest Performer | ||||||||
| 0.6B | 0.206 | 0.261 | 0.150 | 60 | 60 | 0.250 | 0.150 | |
| 1.7B | 0.218 | 0.286 | 0.150 | 60 | 60 | 0.267 | 0.150 | |
| 4B | 0.228 | 0.306 | 0.150 | 60 | 60 | 0.283 | 0.150 | |
| 8B | 0.233 | 0.317 | 0.150 | 60 | 60 | 0.317 | 0.150 |
Key Observations:
- Diminishing Returns: The performance gain from 4B to 8B is minimal (avg 1-3% improvement) while computational cost increases significantly
- Language-Specific Patterns: Some languages (e.g., Telugu, Hindi) show poor performance across all model sizes, indicating data quality issues
- 4B Sweet Spot: The 4B model achieves 95-98% of 8B performance at 60% of the computational cost
- Small Sample Impact: Languages with <50 test samples show high variance between test/validation scores
- Downloads last month
- 25