Model Card
We release open-weight early experimental Codeforce metatune-gpt20b, fine tuned version of OpenAI's gpt-oss-20b model, this is one of the first public release recursive self improving AI.
- Generates new data for itself of Codeforce-Cot
- Evaluates its performance, and
- Adjusts its own hyperparameters based on improvement metrics.
Use cases:
- Coding
Guardrails:
- generally, please set reasoning = "high", it will usually prevent jailbreaking and prompt injection
- use safety gpt oss 20b for guardrails before this model: openai/gpt-oss-safeguard-20b
Inference examples
Transformers
You can use gpt-oss-120b and gpt-oss-20b with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use model.generate directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package.
To get started, install the necessary dependencies to setup your environment:
pip install -U transformers kernels torch
For Google Colab (free/Pro)
!pip install -q --upgrade torch
!pip install -q transformers triton==3.4 kernels
!pip uninstall -q torchvision torchaudio -y
Once, setup you can proceed to run the model by running the snippet below:
from transformers import pipeline
import torch
model_id = "EpistemeAI/Codeforce-metatune-gpt20b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Derive the Euler–Lagrange equation from the principle of stationary action.""},
]
outputs = pipe(
messages,
max_new_tokens=3000,
)
print(outputs[0]["generated_text"][-1])
Reasoning levels
You can adjust the reasoning level that suits your task across three levels:
- Low: Fast responses for general dialogue.
- Medium: Balanced speed and detail.
- High: Deep and detailed analysis.
The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
Tool use
The gpt-oss models are excellent for:
- Web browsing (using built-in browsing tools)
- Function calling with defined schemas
- Agentic operations like browser tasks
Fine-tuning
Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
This smaller model gpt-oss-20b can be fine-tuned on consumer hardware, whereas the larger gpt-oss-120b can be fine-tuned on a single H100 node.
Benchmark
#humaneval
!lm_eval --model hf --model_args pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16 --tasks humaneval --trust_remote_code --confirm_run_unsafe_code --num_fewshot 0 --gen_kwargs temperature=0.9,top_p=0.9,max_new_tokens=1024 --batch_size auto:4 --limit 10 --device cuda:0 --output_path ./eval_harness/gpt-oss-20b3
hf (pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (temperature=0.9,top_p=0.9,max_new_tokens=1024), limit: 10.0, num_fewshot: 0, batch_size: auto:4
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| humaneval | 1 | create_test | 0 | pass@1 | 0.9 | ± | 0.1 |
🧠 Model Benchmark Comparison
This table presents HumanEval benchmark scores across several large language models.
| Model | HumanEval |
|---|---|
| Codeforce-GPT-oss-20b | 90 |
| Qwen 3 235B | 80 |
| DeepSeek-R1 70B | 88 |
| Phi-4 Reasoning | 88 |
| Llama 4 Scout | 78 |
| Llama 3.3 70B | 83 |
| Gemma 3 27B | 76 |
| GPT-OSS 20B | 73 |
| GPT-OSS 120B | 71 |
📊 Notes
- HumanEval measures coding problem-solving and reasoning ability.
- Scores are normalized for consistency across models.
- Models highlighted in bold achieved top-tier performance.
🔍 Summary
Codeforce-GPT-oss-20b leads the benchmark, surpassing even larger models like Qwen 3 235B and DeepSeek-R1 70B. Its superior reasoning and code synthesis capabilities indicate an optimized training strategy rather than sheer scale dominance.
- Developed by: EpistemeAI
- License: apache-2.0
- Finetuned from model : unsloth/gpt-oss-20b-unsloth-bnb-4bit
This gpt_oss model was trained 2x faster with Unsloth and Huggingface's TRL library.
Citation
@misc{bi2025gptossgoodcomprehensiveevaluation,
title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models},
author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song},
year={2025},
eprint={2508.12461},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.12461},
}
- Downloads last month
- 104
