For full information, go check out the Dr Tulu paper here.

Figure 1

DR Tulu 8B - MLX 4-bit Quantized

This is DR Tulu 8B converted to MLX 4-bit quantized format for efficient inference on Apple Silicon hardware. This variant is optimized for speed with minimal memory requirements while preserving DR-Tulu's signature reasoning capabilities.

MLX Model Variants - Complete Collection

Choose the best variant for your hardware and performance needs:

Model Precision Model Size Bits/Weight Memory Usage Performance Repository
DR-Tulu-8B-MLX-4bit 4-bit quantized 4.3GB 4.500 4.9GB 78.2 tok/s Plurigrid/DR-Tulu-8B-MLX-4bit
DR-Tulu-8B-MLX-6bit 6-bit quantized 6.2GB 6.500 6.9GB 60.7 tok/s Plurigrid/DR-Tulu-8B-MLX-6bit
DR-Tulu-8B-MLX-8bit 8-bit quantized 8.1GB 8.500 8.8GB 59.8 tok/s Plurigrid/DR-Tulu-8B-MLX-8bit
DR-Tulu-8B-MLX-bf16 bfloat16 (full) 15.3GB ~16.000 16.4GB 35.0 tok/s Plurigrid/DR-Tulu-8B-MLX-bf16

Why Choose 4-bit?

  • Optimized Performance: 78.2 tokens/sec (2.2x faster than bf16)
  • Minimal Memory: 4.9GB RAM usage (3.4x less than bf16)
  • Device Compatibility: Runs on 8GB+ Apple Silicon devices
  • Preserved Reasoning: Full DR-Tulu <think> capabilities intact

Quick Start

Command Line Interface

# Interactive chat (recommended)
uvx --from mlx-lm mlx_lm.chat --model Plurigrid/DR-Tulu-8B-MLX-4bit

# Generate text
uvx --from mlx-lm mlx_lm.generate --model Plurigrid/DR-Tulu-8B-MLX-4bit --prompt "What is category theory?" --max-tokens 500

Python API

from mlx_lm import load, generate

# Load the 4-bit quantized model
model, tokenizer = load("Plurigrid/DR-Tulu-8B-MLX-4bit")

prompt = "Explain quantum computing step by step."

# Apply chat template if available
if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

# Generate response with DR-Tulu reasoning
response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)

Installation

# Install MLX-LM
pip install mlx-lm
# or with uv
uv add mlx-lm

Hardware Requirements

Component Minimum Recommended
Platform Apple Silicon (M1/M2/M3/M4/M5) M1 Pro/Max or newer
Memory 8GB unified memory 16GB+ unified memory
Storage 5GB free space 10GB+ free space
OS macOS 12+ macOS 14+ (Sonoma)

Tested Configuration: Mac Studio M1 Ultra (20-core CPU, 128GB unified memory), macOS Sequoia 15.2

Technical Specifications

4-bit Quantization Details:

  • Quantization Method: MLX native affine quantization
  • Effective Bits: 4.500 bits per weight
  • Group Size: 128 (default)
  • Conversion Command: mlx_lm.convert --quantize --q-bits 4
  • Quality Preservation: Excellent (maintains reasoning patterns)

Performance Metrics:

  • Inference Speed: 78.2 tokens/second
  • Memory Efficiency: 4.9GB peak usage
  • Model Loading: ~3-5 seconds
  • Quality: Preserves DR-Tulu's signature <think> reasoning

About DR Tulu

This is the RL checkpoint of DR Tulu, an open deep research agent trained on top of rl-research/DR-Tulu-SFT-8B.

Key Capabilities:

  • Step-by-step reasoning with visible <think> tags
  • Research-grade analysis and problem-solving
  • Tool-use and multi-turn conversations
  • Mathematical and scientific reasoning

For more details on DR Tulu, please read our paper!

Evaluation Results

Results from the original DR-Tulu-8B model (quality preserved in 4-bit variant):

Benchmark SQAv2 HealthBench ResearchQA DeepResearch Bench SimpleQA 2Wiki WebWalker Average
DR-Tulu-8B 86.7 43.7 71.1 41.8 80.1 68.0 39.1 61.5

Advanced Usage

Multi-turn Conversation

messages = [
    {"role": "user", "content": "What is category theory?"},
    {"role": "assistant", "content": "Category theory is a mathematical framework..."},
    {"role": "user", "content": "How does it apply to computer science?"}
]

if tokenizer.chat_template is not None:
    formatted_prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    response = generate(model, tokenizer, prompt=formatted_prompt, max_tokens=1000)

Research-Style Analysis

research_prompt = """
Analyze the relationship between quantum mechanics and information theory.
Think step by step and provide a comprehensive analysis.
"""

response = generate(model, tokenizer, prompt=research_prompt, max_tokens=1500, verbose=True)

Related Links

License & Usage

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

4-bit Specific Considerations:

  • Optimized for Apple Silicon hardware only
  • Excellent quality preservation with 4.500 bits per weight
  • Fastest inference in the MLX model series
  • Ideal for real-time applications and resource-constrained environments

Citation

@article{drtulu,
  title = {{DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research}},
  author = {{Rulin Shao, Akari Asai, Shannon Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Sam Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldani, Faeze Brahman, Scott Yih, Sherry Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hanna Hajishirzi, Pang Wei Koh}},
  year = {2025},
}

πŸ”„ Conversion Details

  • Conversion Date: November 22, 2024
  • Converter: MLX community via Plurigrid
  • Command: uvx --from mlx-lm mlx_lm.convert --hf-path rl-research/DR-Tulu-8B --mlx-path ./DR-Tulu-8B-4bit --quantize --q-bits 4
  • Framework Version: mlx-lm latest (November 2024)
  • Validation: Tested with 1069-token generation maintaining quality
Downloads last month
63
Safetensors
Model size
1B params
Tensor type
BF16
Β·
U32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Plurigrid/DR-Tulu-8B-MLX-4bit

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(5)
this model