Difficulty Scorer v2
A Qwen3-8B based difficulty scorer trained on our own difficulty data, as it is used in our EMNLP 2025 submission titled
Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy [REF]
The model can be used to classify the difficulty of model instructions. More challenging instructions are associated with better learning outcomes during training.
Model Architecture
- Finetuned model based on
Qwen/Qwen3-8B - Custom head: Regression head on top of pooling layer.
For more details, see model.py
TODO: erase doubled weights from regression_head.bin
How to Use
from transformers import AutoModelForCausalLM
# Get model and tokenizer
model = AutoModelForCausalLM.from_pretrained("IIS-NLP-internal/qwen3-8B-difficulty-scorer-v2", trust_remote_code=True)
tokenizer = model.get_tokenizer()
# Prepare input data
current_category = "Math"
system_template = "You are an expert of {category} data. You judge problems for their difficulty."
instructions = ["What is the sum of 1 and 2?",
"What are all values of $p$ such that for every $q>0$, " \
"we have $$\frac{3(pq^2+p^2q+3q^2+3pq)}{p+q}>2p^2q?$$ Express your answer in interval notation in decimal form."
]
convs = [[{"role": "system", "content": system_template.format(category=current_category)}, {"role": "user", "content": instruction}] for instruction in instructions]
conv_1_tokenized = tokenizer.apply_chat_template(convs[0], tokenize=True, return_tensors="pt").to(model.model.device)
conv_2_tokenized = tokenizer.apply_chat_template(convs[1], tokenize=True, return_tensors="pt").to(model.model.device)
difficulty_1 = model(conv_1_tokenized)['logits'].item()
difficulty_2 = model(conv_2_tokenized)['logits'].item()
print(difficulty_1, difficulty_2)
# -0.12232150137424469 0.1787720024585724
Model Files
pytorch_model-0000x-of-00002.binβ finetuned model weightsregression_head.bin- custom regression headconfig.jsonβ configuration including base model and head detailstokenizer.json,vocab.txt, etc. β tokenizer filesmodel.pyβ custom regression model implementation
Evaluation
We mostly checked the validity of the scorer through it's downstream benefits in training (see paper). We additionally did a sanity check with coding data from deepmind/code_contests, which contains difficulty scores:
Correlation of our difficulty scores with code_contest data is r = 0.41
Responsible
Mostly Lucas W.
- Downloads last month
- 17
