--- license: cc-by-nc-sa-4.0 language: - en base_model: - Qwen/Qwen2.5-7B-Instruct pipeline_tag: text-generation tags: - chemistry --- ### Introduction OmniChem is a new series of large language models specialized for the domain of chemistry. It is designed to address the critical challenge of model hallucination in scientific applications. For OmniChem, we release this 7B instruction-tuned model with strong reasoning capabilities. OmniChem brings the following key innovations: * **Systematic Hallucination Mitigation**: Significantly mitigates model hallucination by internalizing physical constraints and structured reasoning patterns, reducing the generation of factually incorrect text. * **Expert-Level Chemistry Capabilities**: Demonstrates high performance in core chemistry research tasks, including **photophysical property modulation**, **physicochemical property optimization**, and **synthesis planning**. * **Built on a Strong Foundation**: Built upon **Qwen2.5-7B-Instruct** through continued pre-training on a **5-billion-token specialized corpus** and fine-tuned with **199,589 QA pairs** and **363,045 Chain-of-Thought (CoT) entries**.The dataset is publicly available on OmniChem-563K(https://huggingface.co/datasets/Billy-Liu-DUT/OmniChem) This repo contains the instruction-tuned 7B OmniChem model, which has the following features: * **Type**: Causal Language Model, Specialized for Chemistry * **Training Stage**: Continued Pre-training & Fine-tuning on Qwen2.5-7B-Instruct * **Architecture**: Transformer with RoPE, SwiGLU, and RMSNorm * **Number of Parameters**: 7B * **Number of Attention Heads (GQA)**: 28 for Q and 4 for KV * **Context Length**: Supports up to 128K tokens * **License**: CC BY-NC-SA 4.0 (for academic, non-commercial use) ### Requirements The code for OmniChem is compatible with the latest Hugging Face `transformers` library. We advise you to use version `4.40.0` or higher. Using older versions may result in unexpected errors. ```bash pip install --upgrade transformers ``` ### Quickstart Here is a code snippet showing how to load the OmniChem model and tokenizer to generate content for a chemistry-related query. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Billy-Liu-DUT/OmniChem-7B-v1" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", # or torch.bfloat16 for better performance device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) # Example prompt for a chemistry task prompt = "Plan a synthetic route for the small molecule drug lidocaine." messages = [ {"role": "system", "content": "You are a chemistry expert. Your task is to answer the user's problem using the most academic and rigorous professor-level language in a structured format. Think step by step."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, top_p=0.9, ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ### Processing Long Texts To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation. For supported frameworks, you can add the following to config.json to enable YaRN for contexts up to 128K tokens: ```json { "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } } ``` ### License This model is licensed under CC BY-NC-SA 4.0 for non-commercial use. Commercial use requires explicit permission. Contact [liubilly@mail.dlut.edu.cn] for inquiries.