Introduction

OmniChem is a new series of large language models specialized for the domain of chemistry. It is designed to address the critical challenge of model hallucination in scientific applications. For OmniChem, we release this 7B instruction-tuned model with strong reasoning capabilities.

OmniChem brings the following key innovations:

Systematic Hallucination Mitigation: Significantly mitigates model hallucination by internalizing physical constraints and structured reasoning patterns, reducing the generation of factually incorrect text.
Expert-Level Chemistry Capabilities: Demonstrates high performance in core chemistry research tasks, including photophysical property modulation, physicochemical property optimization, and synthesis planning.
Built on a Strong Foundation: Built upon Qwen2.5-7B-Instruct through continued pre-training on a 5-billion-token specialized corpus and fine-tuned with 199,589 QA pairs and 363,045 Chain-of-Thought (CoT) entries.The dataset is publicly available on OmniChem-563K(https://huggingface.co/datasets/Billy-Liu-DUT/OmniChem)

This repo contains the instruction-tuned 7B OmniChem model, which has the following features:

Type: Causal Language Model, Specialized for Chemistry
Training Stage: Continued Pre-training & Fine-tuning on Qwen2.5-7B-Instruct
Architecture: Transformer with RoPE, SwiGLU, and RMSNorm
Number of Parameters: 7B
Number of Attention Heads (GQA): 28 for Q and 4 for KV
Context Length: Supports up to 128K tokens
License: CC BY-NC-SA 4.0 (for academic, non-commercial use)

Requirements

The code for OmniChem is compatible with the latest Hugging Face transformers library. We advise you to use version 4.40.0 or higher. Using older versions may result in unexpected errors.

pip install --upgrade transformers

Quickstart

Here is a code snippet showing how to load the OmniChem model and tokenizer to generate content for a chemistry-related query.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Billy-Liu-DUT/OmniChem-7B-v1"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto", # or torch.bfloat16 for better performance
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example prompt for a chemistry task
prompt = "Plan a synthetic route for the small molecule drug lidocaine."
messages = [
    {"role": "system", "content": "You are a chemistry expert. Your task is to answer the user's problem using the most academic and rigorous professor-level language in a structured format. Think step by step."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation. For supported frameworks, you can add the following to config.json to enable YaRN for contexts up to 128K tokens:

{
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

License

This model is licensed under CC BY-NC-SA 4.0 for non-commercial use. Commercial use requires explicit permission. Contact [[email protected]] for inquiries.

Downloads last month: 4

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Billy-Liu-DUT/OmniChem-7B-v1

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2286)

this model