Update README.md

f6e6f24 verified 3 days ago

3.75 kB

	---
	license: apache-2.0
	datasets:
	- zkzhang88/OCEData
	base_model:
	- Qwen/Qwen3-8B-Base
	---

	# OpenCodeEdit Series Models Quick Start Guide (OpenCodeEdit-Qwen3-8B)

	For details, please refer to our [Arxiv](https://arxiv.org/abs/2509.25203)

	We advise you to use the latest version of `transformers`.

	Requirements:
	```
	transformers
	torchvision
	torchaudio
	tensorboard
	```
	## Model Overview

	OpenCodeEdit-Qwen3-8B has the following features:
	- Type: Causal Language Models
	- Number of Parameters: 8.2B
	- Number of Paramaters (Non-Embedding): 6.95B
	- Number of Layers: 36
	- Number of Attention Heads (GQA): 32 for Q and 8 for KV
	- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).

	The following contains a prompt template. Please construct the prompt according to the template.

	Prompt Template：
	```
	System Prompt:
	You are a code editor. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose.

	User Prompt:
	## Code Before:
	{pre_edit_code}

	## Instruction:
	{instruction}

	## Code After:
	```

	The following contains a code snippet illustrating how to use the model generate content based on given inputs.

	```python
	import re
	from transformers import AutoModelForCausalLM, AutoTokenizer

	def extract_first_python_block(text: str) -> str:
	pattern = r"```python\s(.?)```"
	match = re.search(pattern, text, re.DOTALL)
	if match:
	return match.group(1).strip()
	return ""

	model_name ="zkzhang88/OpenCodeEdit-Qwen3-8B" #"zkzhang88/OpenCodeEdit-Qwen3-8B" "zkzhang88/OpenCodeEdit-Qwen2.5-7B" "zkzhang88/OpenCodeEdit-DSC-6.7B"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	pre_edit_code = """
	def fibonacci(n):
	if n <= 1:
	return n
	return fibonacci(n-1) + fibonacci(n-2)
	"""
	SYSTEM_PROMPT = "You are a code editor. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose."
	instruction = "Optimize the calculation method for the Fibonacci sequence by reducing recursive calls and employing dynamic programming to enhance efficiency."

	formatted_input = f"""
	## Code Before:
	{pre_edit_code}
	## Instruction:
	{instruction}
	## Code After:
	"""

	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": formatted_input}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)


	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
	content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

	print(extract_first_python_block(content))
	```

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@misc{zhang2025generatinghighqualitydatasetscode,
	title={Generating High-Quality Datasets for Code Editing via Open-Source Language Models},
	author={Zekai Zhang and Mingwei Liu and Zhenxi Chen and Linxi Liang and Yuxuan Chen and Guangsheng Ou and Yanlin Wang and Dan Li and Xin Peng and Zibin Zheng},
	year={2025},
	eprint={2509.25203},
	archivePrefix={arXiv},
	primaryClass={cs.SE},
	url={https://arxiv.org/abs/2509.25203},
	}
	```