Safetensors
English
qwen2

McGill-NLP/delethink-24k-1.5b

TL;DR

  • Markovian Thinking for RL in reasoning LLMs: replace the trivial MDP where state = prompt + all past thinking tokens (quadratic compute) with a bounded, fixed-size state, yielding linear compute in thinking tokens and constant memory by design.
  • Delethink RL trains a model to “think” in fixed-size chunks with bounded state..
  • This 1.5B model uses an effective thinking budget of about 24K tokens while only requiring an 8K active context at any time via chunked rollouts and short carryovers.
  • Initialized from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, trained with the Delethink RL paradigm. See the paper for full details.

Links

Model Summary

  • Base model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
  • Objective: Reinforcement Learning for long-form reasoning with bounded, chunked thinking (Delethink) trained for 1000 steps.
  • Delethink 24K budget: uses 8K “context size” chunks, short “markovian” carryovers, and up to 5 chunk iterations for ~24K total thinking tokens.
  • Intended use: Math/logic reasoning with step-by-step derivations; final answer typically formatted inside LaTeX \boxed{}.
  • Library compatibility: Works well with SGLang for chunked inference; also usable with Transformers for standard generation (chunking requires manual orchestration; see paper for an example).

Intended Uses and Limitations

  • Intended uses:
    • Long-form reasoning on math and related tasks.
    • Bounded-context rollouts with repeated chunking and short carryovers.
  • Not intended for:
    • Safety-sensitive applications without human oversight.
    • Use cases requiring faithful, verifiable citations to external sources.
  • Limitations:
    • May hallucinate, make arithmetic/algebraic mistakes, or produce inconsistent plans.
    • The chunked rollout procedure is needed to realize Delethink’s efficiency advantages.

How Delethink Works (Concept)

Let:

  • C = context_size per chunk (active KV memory)
  • m = markovian_size = number of tokens carried over to the next chunk
  • I = iteration_cap = maximum number of chunks

Effective thinking budget is:

  • C + (I − 1) × (C − m)

For this checkpoint, we recommend:

  • C = 8192
  • m = 4096
  • I ≤ 5 This yields an effective budget ≈ 8192 + 4 × (8192 − 4096) = 24576 tokens of thinking.

Prompting

  • Use the model’s chat template and request a step-by-step solution with a final boxed answer:
    • “Please reason step by step, and put your final answer within \boxed{}.”

Quickstart (SGLang, chunked Delethink rollout)

import asyncio
import sglang as sgl

async def delethink_tracing(llm, query_ids, context_size=8192, markovian_size=4096, iteration_cap=5):
    sampling_params = {"temperature": 0.6}
    trace_response_ids = []
    iterations = 0
    prompt_ids = query_ids

    while iterations < iteration_cap:
        params = dict(sampling_params)
        params["max_new_tokens"] = (context_size - markovian_size) if iterations > 0 else context_size

        resp = await llm.async_generate(input_ids=prompt_ids, sampling_params=params, return_logprob=True)
        if "output_ids" in resp:
            out_ids = resp["output_ids"]
        else:
            _, out_ids = zip(*[(lp, tids) for lp, tids, _ in resp["meta_info"]["output_token_logprobs"]])
            out_ids = list(out_ids)

        trace_response_ids.append(out_ids)

        if iterations == 0:
            query_ids = query_ids + out_ids[:100]

        finish_reason_is_eos = resp["meta_info"]["finish_reason"]["type"] == "stop"
        if finish_reason_is_eos:
            break

        prompt_ids = query_ids + out_ids[-markovian_size:]
        iterations += 1

    return sum(trace_response_ids, [])

def main():
    llm = sgl.Engine(
        model_path="McGill-NLP/delethink-24k-1.5b",
        dtype="bfloat16",
        attention_backend="flashinfer",
        mem_fraction_static=0.8,
        log_level="WARNING",
    )

    prompt = (
        r"There exist real numbers $x$ and $y$, both greater than 1, such that "
        r"$\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$."
        "\n\nPlease reason step by step, and put your final answer within \\boxed{}."
    )
    tok = llm.tokenizer_manager.tokenizer
    query_ids = tok.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=True,
        add_generation_prompt=True,
    )

    ids = asyncio.run(delethink_tracing(llm, query_ids, context_size=8192, markovian_size=4096, iteration_cap=5))
    print(tok.decode(ids, skip_special_tokens=False))

if __name__ == "__main__":
    main()

Suggested generation settings

  • temperature: 0.6
  • top_p: 1.0
  • top_k: -1

Safety and Use

  • This model can produce incorrect or misleading reasoning steps and answers. Always verify results.
  • Do not deploy in high-stakes domains without human oversight.

Citation

@misc{Aghajohari2025:TheMarkovianThinker,
      title={The Markovian Thinker}, 
      author={Milad Aghajohari and Kamran Chitsaz and Amirhossein Kazemnejad and Sarath Chandar and Alessandro Sordoni and Aaron Courville and Siva Reddy},
      year={2025},
      eprint={2510.06557},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.06557}, 
}
Downloads last month
498
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for McGill-NLP/delethink-24k-1.5b

Finetuned
(500)
this model
Quantizations
2 models

Dataset used to train McGill-NLP/delethink-24k-1.5b

Collection including McGill-NLP/delethink-24k-1.5b