Reason-Rewriter

Introduction

We propose Reason-Rewriter, a rewriting model designed for reasoning-intensive document retrieval tasks. It rewrites queries into more suitable forms to enhance retrieval performance.

For more details please refer to our Github: ReasonRewriter

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "cfli/reasoner-rewriter-qwen2.5-7b-0821"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt_template = """\
Given a task and an input, first analyze the task and the input within the `<think>` and `</think>` tags. In your analysis:
- Break down the requirements of the task
- Identify key components from the input
- Think step by step to reason about what should be included in the output

Then, within the `<response>` and `</response>` tags, present the complete long output.

## Task
{task}

## Input
{query}"""

query = """Claim in article about why insects are attracted to light
In this article they are addressing the reason insects are attracted to light when they say
Heat radiation as an attractive component is refuted by the effect of LED lighting, which supplies negligible infrared radiation yet still entraps vast numbers of insects.
I don't see why attraction to LEDs shows they're not seeking heat. Could they for example be evolutionarily programmed to associate light with heat? So that even though they don't encounter heat near/on the LEDs they still "expect" to?"""

task = 'Generate information that is relevant and helpful to address the questions in detail.'

messages = [
    {"role": "user", "content": prompt_template.format(query=query, task=task)}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    temperature=0.6,
    top_p=0.9,
    max_new_tokens=4096
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

rewrite_query = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Instructions for Each Task

task_desc = {
    'biology': 'Generate information that is relevant and helpful to address the questions in detail.',
    'earth_science': 'Generate information that is relevant and helpful to address the questions in detail.',
    'economics': 'Generate information that is relevant and helpful to address the questions in detail.',
    'psychology': 'Generate information that is relevant and helpful to address the questions in detail.',
    'robotics': 'Generate information that is relevant and helpful to address the questions in detail.',
    'stackoverflow': 'Generate information that is relevant and helpful to address the questions in detail.',
    'sustainable_living': 'Generate information that is relevant and helpful to address the questions in detail.',
    'pony': 'Generate the Pony syntax and features (including basic components) required to solve this code problem, and provide a detailed explanation of these syntax and features.',
    'leetcode': 'Generate the data structures (of any kind) and algorithms (e.g., DFS, DP, sorting, traversals, etc.) required to solve the problem, and provide a detailed explanation of the data structures and algorithms.',
    'aops': 'Generate the functions and techniques involved in solving this math problem, and provide a detailed explanation of the functions and techniques.',
    'theoremqa_theorems': 'Generate the theorem necessary to solve this mathematical problem, and provide a detailed explanation of the theorem.',
    'theoremqa_questions': 'Generate the mathematical concepts, formulas, and theorems required to solve this math problem, and provide detailed explanations for each.'
}

Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{li-etal-2025-reinforced,
    title="Reinforced {IR}: A Self-Boosting Framework For Domain-Adapted Information Retrieval",
    author="Chaofan Li and Jianlyu Chen and Yingxia Shao and Chaozhuo Li and Quanqing Xu and Defu Lian and Zheng Liu},
    month=jul,
    year="2025",
    publisher="Association for Computational Linguistics",
    url="https://aclanthology.org/2025.acl-long.1071/",
}

@article{chen2025reasonembed,
  title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval},
  author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2510.08252},
  year={2025}
}

@article{lan2025retro,
  title={Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval},
  author={Lan, Junwei and Chen, Jianlyu and Liu, Zheng and Li, Chaofan and Bao, Siqi and Lian, Defu},
  journal={arXiv preprint arXiv:2509.24869},
  year={2025}
}

Downloads last month: 21

Safetensors

Model size

8B params

Tensor type

F32

Model tree for cfli/reasoner-rewriter-qwen2.5-7b-0821

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2792)

this model