Safetensors
qwen3

query-context-pruner-multilingual

A multilingual model designed to remove contexts irrelevant to queries. Since this model is a relatively large LLM, practical pruning methods like Provence are needed for speed-critical information retrieval tasks. This model is particularly well-suited for generating teacher labels for training such efficient pruning models.

We offer two model sizes: 4B model for accuracy-focused label generation where precision matters more than speed, and 1.7B model for high-speed label generation.

Why This Model is Valuable

Based on the Provence paper, context pruning addresses critical challenges in RAG systems:

  • Context Noise Problem: Retrieved documents often contain irrelevant sentences that mislead LLMs and degrade response quality
  • Computational Overhead: Long contexts slow down generation and increase costs significantly
  • Performance Degradation: Irrelevant information leads to hallucinations and poor answer quality

Our model enables efficient training data generation for lightweight context pruners that can:

  • Reduce context length by 70-90% while maintaining accuracy
  • Achieve almost zero-cost pruning when unified with reranking (as shown in Provence)
  • Work out-of-the-box across diverse domains and languages
  • Dynamically detect optimal pruning ratios (0-100%) per context

This addresses the practical need for robust, adaptable context pruners that can be deployed in production RAG systems without compromising performance.

πŸš€ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose model size
MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-4B"   # Recommended (best balance)
# MODEL_NAME = "hotchpotch/query-context-pruner-multilingual-Qwen3-1.7B"  # Faster alternative

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)


def create_user_prompt(query: str, contexts: list[str]) -> str:
    """Create a formatted prompt from query and context chunks."""
    context_str = "\n".join([f"[{i+1}] {ctx}" for i, ctx in enumerate(contexts)])
    return f"{query}\n---\n{context_str}"


# Example: Python documentation split into chunks
query = "How do you read and write files in Python?"
contexts = [
    "Python is a high-level programming language known for its simplicity and readability.",
    "To read a file in Python, you use the open() function with the 'r' mode. For example: with open('file.txt', 'r') as f: content = f.read()",
    "Writing to files in Python also uses open() with 'w' mode for writing or 'a' mode for appending. Example: with open('file.txt', 'w') as f: f.write('Hello')",
    "Python supports multiple programming paradigms including object-oriented and functional programming.",
    "The 'with' statement ensures proper file handling by automatically closing files after use, preventing resource leaks.",
    "Python has various built-in functions like len(), range(), and type() that are commonly used in everyday programming."
]

# Generate response
prompt = create_user_prompt(query, contexts)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

print(f"Relevant chunks: {response}")  # Expected: "2,3,5" (reading, writing, and proper file handling)

# Note: The model's output indices start from 1, not 0

πŸ“‹ Model Description

Query-Context Pruner is designed to solve the context overload problem in RAG systems by identifying which text chunks contain information relevant to answering a query. This enables:

  • Efficient RAG pipelines: Reduce context length by 70-90% while maintaining accuracy
  • Training data generation: Create high-quality labels for smaller bi-encoder models
  • Multilingual information retrieval: Works across 20 languages with varying performance
  • Context compression: Eliminate irrelevant information before passing to LLMs

🌍 Supported Languages

The model supports 20 languages with varying performance levels. Important: While some languages show excellent performance, others have limited effectiveness. Please carefully review your target language's performance before use.

Note on Evaluation: Many test datasets (especially MIRACL) have very small sample sizes, making test scores less reliable. We report averaged F1 scores (test + validation) for more robust evaluation.

Language Performance Summary (4B Model)

Language Avg F1 Test F1 Val F1 Test Samples Val Samples Performance Tier
High Performance Languages (F1 > 0.85)
Russian (ru) 0.901 0.910 0.891 82 82 πŸ† Excellent
Arabic (ar) 0.899 0.916 0.882 61 61 πŸ† Excellent
Finnish (fi) 0.877 0.874 0.880 48 48 πŸ† Excellent
Indonesian (id) 0.871 0.858 0.883 64 64 πŸ† Excellent
Japanese (ja)* 0.865 0.860 0.870 478 478 πŸ† Excellent
Good Performance Languages (F1 0.75-0.85)
English (en)* 0.847 0.874 0.819 682 682 βœ… Very Good
Korean (ko) 0.849 0.910 0.787 30 30 βœ… Very Good†
German (de) 0.843 0.776 0.909 14 14 βœ… Very Good†
Persian (fa) 0.818 0.739 0.896 27 27 βœ… Very Good†
Swahili (sw) 0.806 0.767 0.845 37 37 βœ… Very Good
Chinese (zh) 0.805 0.777 0.833 38 38 βœ… Good
Moderate Performance Languages (F1 0.60-0.75)
Spanish (es) 0.715 0.800 0.629 31 31 ⚑ Moderate‑
Yoruba (yo) 0.662 0.630 0.694 18 18 ⚑ Moderate†
French (fr) 0.622 0.774 0.470 29 29 ⚑ Moderate‑
Limited Performance Languages (F1 < 0.60)
Thai (th) 0.378 0.239 0.516 56 56 ⚠️ Poor
Hindi (hi) 0.348 0.538 0.158 36 36 ⚠️ Poor‑
Bengali (bn) 0.217 0.222 0.212 23 23 ⚠️ Very Poor
Telugu (te) 0.109 0.067 0.150 61 61 ⚠️ Very Poor

* Average across multiple datasets (Japanese: 8 datasets, English: MS-MARCO + MIRACL)
† Very small test set - interpret with caution
‑ High variance between test/validation - less stable performance

Note: Performance varies by domain. Languages with more training data generally perform better.

πŸ“Š Model Evaluation

We evaluated four model sizes on 28 datasets covering multiple languages and domains. Results shown are combined test+validation scores for robustness.

Model Size Comparison

Model Avg F1 Avg Exact Match Relative Speed Recommendation
4B 0.820 0.712 1.0x Recommended
1.7B 0.794 0.666 1.5x High-speed option
8B 0.827 0.727 0.6x Maximum accuracy (not released)
0.6B 0.707 0.600 2.5x Baseline (not released)

Performance Highlights

Why 4B Model is Recommended:

  • Only 0.9% F1 gap with 8B model (0.820 vs 0.827)
  • 67% faster inference than 8B model
  • Perfect balance of accuracy and efficiency
  • Optimal for production deployments

Detailed Performance by Dataset (4B Model)

Dataset Test F1 Val F1 Combined F1 Test/Val Samples Notes
Japanese Datasets
jaquad 0.994 0.952 0.973 47/47 Excellent
mr-tydi-ja 0.939 0.791 0.865 55/55 Very good
jsquad 0.942 0.787 0.865 53/53 Very good
jqara 0.906 0.888 0.897 54/54 Excellent
mkqa-ja 0.679 0.823 0.751 55/55 Good
quiz-works 0.933 0.869 0.901 51/51 Excellent
quiz-no-mori 0.607 0.879 0.743 52/52 Good
ms-marco-ja 0.846 0.821 0.834 168/168 Very good
English Datasets
ms-marco-en 0.888 0.869 0.879 630/630 Very good
miracl-en 0.860 0.769 0.815 52/52 Very good†
MIRACL Languages
miracl-ar 0.916 0.882 0.899 61/61 Excellent†
miracl-ru 0.910 0.891 0.901 82/82 Excellent†
miracl-fi 0.874 0.880 0.877 48/48 Very good†
miracl-id 0.858 0.883 0.871 64/64 Very good†
miracl-ko 0.910 0.787 0.849 30/30 Very good†
miracl-fa 0.739 0.896 0.818 27/27 Very good†
miracl-de 0.776 0.909 0.843 14/14 Very good†
miracl-sw 0.767 0.845 0.806 37/37 Very good†
miracl-zh 0.777 0.833 0.805 38/38 Very good†
miracl-es 0.800 0.629 0.715 31/31 Good†
miracl-yo 0.630 0.694 0.662 18/18 Moderate†
miracl-fr 0.774 0.470 0.622 29/29 Moderate†
Low Performance
miracl-hi 0.538 0.158 0.348 36/36 Poor†
miracl-bn 0.222 0.212 0.217 23/23 Poor†
miracl-th 0.239 0.516 0.378 56/56 Poor†
miracl-te 0.067 0.150 0.109 61/61 Poor†
Special Cases
auto-wiki-qa-nemotron 0.571 0.912* 0.742 41/41 *Large split difference

Note: We show both test and validation scores because some datasets have small test sets, making validation scores more reliable. Large differences between splits (>0.1) may indicate data quality issues.

† Important: All MIRACL datasets have fewer than 100 test samples, with many having fewer than 50. During dataset creation, we should have allocated more samples to the test sets for MIRACL languages. The small sample sizes raise concerns about the statistical reliability of the reported metrics. Performance on these datasets should be interpreted with caution, as the confidence intervals are likely to be wide.

🎯 Model Selection Guide

Choose 4B (Recommended) if you need:

  • Optimal balance of accuracy and speed
  • Production deployment
  • F1 > 0.80 on major languages
  • Cost-effective inference
  • Best performance per computational cost

Choose 1.7B if you need:

  • Maximum speed for real-time applications
  • Resource-constrained environments
  • F1 > 0.75 is sufficient
  • Edge deployment scenarios

πŸ“š Training Details

  • Dataset: qa-context-relevance-multilingual-140k
    • 142,629 samples across 20 languages
    • English: MS-MARCO dataset for robust English QA performance
    • Japanese: Various Japanese QA datasets (JaQuAD, JSQuAD, JQARA, Quiz datasets) for comprehensive coverage
    • Multilingual: MIRACL dataset covering 18 languages for broad multilingual support
    • Important Note: Relevance labels were generated using DeepSeek-V3-0324, an LLM, without human verification. Therefore, the ground truth data itself may contain errors and should not be considered perfect annotations.
  • Base Model: Qwen-3 series
  • Method: Supervised Fine-Tuning (SFT)
  • Training Strategy: Multi-language curriculum learning

πŸ”§ Technical Implementation

The model implements context pruning as described in:

Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation
Zhang et al., 2025 (arXiv:2501.16214)

Key features:

  • Citation-style output format (e.g., "1,3" for chunks 1 and 3)
  • Handles multiple relevant chunks
  • Language-agnostic prompt format
  • Optimized for chunk-level relevance

πŸ“„ License

Apache-2.0

πŸ‘€ Author

Yuichi Tateno (@hotchpotch)


πŸ“– Appendix

A. Dataset Creation Process

The training dataset was created through:

  1. Collection of diverse QA datasets:
    • MS-MARCO for English
    • Japanese QA datasets (JaQuAD, JSQuAD, JQARA, MKQA-ja, Mr.TyDi-ja, Quiz-no-mori, Quiz-works)
    • MIRACL for 18 additional languages
  2. Text chunking using language-specific tools
  3. Relevance annotation using DeepSeek-V3-0324 (LLM-based annotation)
  4. Quality filtering and balancing

Note: All relevance annotations were generated by an LLM without human verification, which may introduce annotation errors.

B. Model Architecture

  • Base: Qwen-3 transformer architecture
  • Context Length: Standard context window
  • Special Tokens: Standard Qwen-3 chat template
  • Output Format: Citation numbers or "No answer"

C. Evaluation Methodology

We evaluated on both test and validation sets because:

  • Some datasets have small test sets (<100 samples)
  • Validation scores provide more stable estimates
  • Combined scores reduce evaluation noise
  • Split differences help identify data quality issues

D. Performance Comparison Across Model Sizes

D.1 Japanese Datasets Performance

Dataset Model Avg F1 Test F1 Val F1 Test Samples Val Samples Test EM Val EM
ms-marco-ja
0.6B 0.645 0.641 0.650 168 168 0.548 0.524
1.7B 0.819 0.834 0.803 168 168 0.702 0.655
4B 0.835 0.852 0.819 168 168 0.750 0.696
8B 0.859 0.890 0.828 168 168 0.780 0.714
jaquad
0.6B 0.926 0.948 0.904 56 56 0.911 0.839
1.7B 0.972 0.978 0.966 56 56 0.929 0.911
4B 0.970 0.994 0.946 56 56 0.982 0.893
8B 0.988 1.000 0.975 56 56 1.000 0.946
mr-tydi-ja
0.6B 0.830 0.850 0.810 55 55 0.818 0.691
1.7B 0.867 0.892 0.842 55 55 0.818 0.691
4B 0.915 0.921 0.908 55 55 0.873 0.782
8B 0.880 0.870 0.891 55 55 0.818 0.764
jsquad
0.6B 0.919 0.903 0.934 57 57 0.842 0.877
1.7B 0.945 0.945 0.946 57 57 0.895 0.877
4B 0.950 0.939 0.960 57 57 0.877 0.912
8B 0.951 0.965 0.937 57 57 0.930 0.877
quiz-works
0.6B 0.870 0.885 0.855 55 55 0.745 0.655
1.7B 0.905 0.936 0.873 55 55 0.800 0.691
4B 0.898 0.933 0.862 55 55 0.836 0.655
8B 0.927 0.940 0.915 55 55 0.855 0.782

D.2 English Datasets Performance

Dataset Model Avg F1 Test F1 Val F1 Test Samples Val Samples Test EM Val EM
ms-marco-en
0.6B 0.765 0.771 0.759 630 630 0.657 0.646
1.7B 0.853 0.859 0.847 630 630 0.737 0.722
4B 0.883 0.889 0.877 630 630 0.784 0.775
8B 0.901 0.904 0.897 630 630 0.798 0.798
miracl-en
0.6B 0.714 0.715 0.713 50 50 0.580 0.640
1.7B 0.791 0.785 0.797 50 50 0.600 0.660
4B 0.830 0.871 0.789 50 50 0.760 0.660
8B 0.832 0.834 0.830 50 50 0.720 0.700

D.3 MIRACL Languages Performance (Selected High/Low Performers)

Language Model Avg F1 Test F1 Val F1 Test Samples Val Samples Test EM Val EM
Arabic (ar) - High Performer
0.6B 0.791 0.772 0.811 61 61 0.656 0.656
1.7B 0.876 0.880 0.872 61 61 0.738 0.721
4B 0.897 0.917 0.877 61 61 0.770 0.754
8B 0.903 0.904 0.902 61 61 0.803 0.803
Russian (ru) - High Performer
0.6B 0.767 0.786 0.749 82 82 0.634 0.622
1.7B 0.869 0.884 0.854 82 82 0.780 0.732
4B 0.895 0.916 0.874 82 82 0.841 0.768
8B 0.879 0.889 0.869 82 82 0.817 0.768
Finnish (fi) - High Performer
0.6B 0.666 0.676 0.655 50 50 0.600 0.480
1.7B 0.822 0.840 0.805 50 50 0.600 0.620
4B 0.874 0.868 0.881 50 50 0.740 0.680
8B 0.892 0.882 0.901 50 50 0.800 0.740
Hindi (hi) - Low Performer
0.6B 0.458 0.447 0.469 29 29 0.379 0.379
1.7B 0.455 0.476 0.435 29 29 0.448 0.379
4B 0.495 0.538 0.451 29 29 0.483 0.379
8B 0.431 0.476 0.387 29 29 0.448 0.345
Telugu (te) - Lowest Performer
0.6B 0.206 0.261 0.150 60 60 0.250 0.150
1.7B 0.218 0.286 0.150 60 60 0.267 0.150
4B 0.228 0.306 0.150 60 60 0.283 0.150
8B 0.233 0.317 0.150 60 60 0.317 0.150

Key Observations:

  1. Diminishing Returns: The performance gain from 4B to 8B is minimal (avg 1-3% improvement) while computational cost increases significantly
  2. Language-Specific Patterns: Some languages (e.g., Telugu, Hindi) show poor performance across all model sizes, indicating data quality issues
  3. 4B Sweet Spot: The 4B model achieves 95-98% of 8B performance at 60% of the computational cost
  4. Small Sample Impact: Languages with <50 test samples show high variance between test/validation scores
Downloads last month
25
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hotchpotch/query-context-pruner-multilingual-Qwen3-4B

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(301)
this model
Quantizations
1 model

Dataset used to train hotchpotch/query-context-pruner-multilingual-Qwen3-4B