Introduction
OmniChem is a new series of large language models specialized for the domain of chemistry. It is designed to address the critical challenge of model hallucination in scientific applications. For OmniChem, we release this 7B instruction-tuned model with strong reasoning capabilities.
OmniChem brings the following key innovations:
- Systematic Hallucination Mitigation: Significantly mitigates model hallucination by internalizing physical constraints and structured reasoning patterns, reducing the generation of factually incorrect text.
- Expert-Level Chemistry Capabilities: Demonstrates high performance in core chemistry research tasks, including photophysical property modulation, physicochemical property optimization, and synthesis planning.
- Built on a Strong Foundation: Built upon Qwen2.5-7B-Instruct through continued pre-training on a 5-billion-token specialized corpus and fine-tuned with 199,589 QA pairs and 363,045 Chain-of-Thought (CoT) entries.The dataset is publicly available on OmniChem-563K(https://huggingface.co/datasets/Billy-Liu-DUT/OmniChem)
This repo contains the instruction-tuned 7B OmniChem model, which has the following features:
- Type: Causal Language Model, Specialized for Chemistry
- Training Stage: Continued Pre-training & Fine-tuning on Qwen2.5-7B-Instruct
- Architecture: Transformer with RoPE, SwiGLU, and RMSNorm
- Number of Parameters: 7B
- Number of Attention Heads (GQA): 28 for Q and 4 for KV
- Context Length: Supports up to 128K tokens
- License: CC BY-NC-SA 4.0 (for academic, non-commercial use)
Requirements
The code for OmniChem is compatible with the latest Hugging Face transformers library. We advise you to use version 4.40.0 or higher. Using older versions may result in unexpected errors.
pip install --upgrade transformers
Quickstart
Here is a code snippet showing how to load the OmniChem model and tokenizer to generate content for a chemistry-related query.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Billy-Liu-DUT/OmniChem-7B-v1"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto", # or torch.bfloat16 for better performance
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example prompt for a chemistry task
prompt = "Plan a synthetic route for the small molecule drug lidocaine."
messages = [
{"role": "system", "content": "You are a chemistry expert. Your task is to answer the user's problem using the most academic and rigorous professor-level language in a structured format. Think step by step."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Processing Long Texts
To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation. For supported frameworks, you can add the following to config.json to enable YaRN for contexts up to 128K tokens:
{
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
License
This model is licensed under CC BY-NC-SA 4.0 for non-commercial use. Commercial use requires explicit permission. Contact [[email protected]] for inquiries.
- Downloads last month
- 8