BR-RM: Branch-and-Rethink Reasoning Reward Model

Model Overview

BR-RM (Branch-and-Rethink Reasoning Reward Model) is a model that implements a novel two-turn reasoning framework to evaluate LLM-generated responses. Unlike traditional reward models that compress all quality dimensions into a single scalar in one shot, BR-RM performs adaptive branching to focus on instance-critical dimensions, followed by branch-conditioned rethinking for targeted deep analysis.

This model achieves state-of-the-art performance on the average score on three major reward modeling benchmarks (RewardBench, RM-Bench, and RMB) by addressing the "judgment diffusion" problem where models spread attention too thinly across evaluation criteria.

Key Features

🎯 Adaptive Focus: Dynamically selects 1-3 critical evaluation dimensions per instance
🔄 Two-Turn Reasoning: First turn branches, second turn performs deep conditioned analysis
📊 SOTA Performance: Top results on RewardBench (92.1%), RM-Bench (85.9%), and RMB (74.7%)
🔧 RLHF Compatible: Designed to integrate seamlessly with standard RLHF pipelines

Model Variants

Model	Parameters	RewardBench	RM-Bench	RMB	Average
Qwen3-Nemotron-8B-BRRM	8B	91.0	85.0	71.8	82.6
Qwen3-Nemotron-14B-BRRM	14B	92.1	85.9	74.7	84.2

How It Works

Two-Turn Framework

Turn 1: Adaptive Branching

Input: User query + Two candidate responses
Output: 
  1. Selected critical dimensions (e.g., "Logical Reasoning", "Computational Precision")
  2. Initial issue detection for each response

Turn 2: Branch-Conditioned Rethinking

Input: Turn 1 results + Evaluation hierarchy
Output: Final comparative judgment and preference ranking

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "nvidia/Qwen3-Nemotron-14B-BRRM"  # or nvidia/Qwen3-Nemotron-8B-BRRM
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example usage
context = "What is 2+2?"
response1 = "2+2=4"
response2 = "2+2=5"

# Format Turn 1: Adaptive Branching
turn1_prompt = f"""You are a response quality evaluator. Given the context and two responses, select the most important cognitive abilities and analyze critical issues.

**Context:** 
{context}

**Responses:**
[The Begin of Response 1]
{response1}
[The End of Response 1]

[The Begin of Response 2]
{response2}
[The End of Response 2]

**Output Format:**
[Quality Assessment Focus]
Choose 1-3 abilities: Information Accuracy, Computational Precision, Logical Reasoning, Implementation Capability, Safety Awareness, Response Completeness, Instruction Adherence, Communication Clarity.
[End of Quality Assessment Focus]

[Quality Analysis for Response 1]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 1]

[Quality Analysis for Response 2]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 2]"""

# Generate Turn 1
messages = [{"role": "user", "content": turn1_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)
outputs = model.generate(
    input_ids, 
    max_new_tokens=8192,      
    temperature=1.0,
    top_p=0.95,               
    top_k=20,                 
    do_sample=True,           
    pad_token_id=tokenizer.eos_token_id
)
turn1_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)


# Format Turn 2: Branch-Conditioned Rethinking
turn2_prompt = f"""You are making final comparative judgments using established evaluation priorities.

**Evaluation Hierarchies:**
- **Accuracy-Critical**: Correctness > Process > Presentation 
- **Creative/Open-Ended**: User Intent > Content Quality > Creativity 
- **Instruction-Following**: Adherence > Content > Clarity

[The Begin of Analysis on Response 1]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 1]

[The Begin of Analysis on Response 2]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 2]

[The Begin of Ranking Score]
\\boxed{{1 or 2}}
[The End of Ranking Score]"""

# Generate Turn 2
messages.append({"role": "assistant", "content": turn1_response})
messages.append({"role": "user", "content": turn2_prompt})
input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)
outputs = model.generate(
    input_ids, 
    max_new_tokens=8192,      
    temperature=1.0,
    top_p=0.95,               
    top_k=20,                 
    do_sample=True,           
    pad_token_id=tokenizer.eos_token_id
)
final_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety and Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

If you find this model useful, please cite the following work:

@misc{jiao2025thinktwicebranchandrethinkreasoning,
      title={Think Twice: Branch-and-Rethink Reasoning Reward Model}, 
      author={Yizhu Jiao and Jiaqi Zeng and Julien Veron Vialard and Oleksii Kuchaiev and Jiawei Han and Olivier Delalleau},
      year={2025},
      eprint={2510.23596},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.23596}, 
}