Billy-Liu-DUT
/

OmniChem-7B-v1

+---
+license: cc-by-sa-4.0
+---
+Introduction
+OmniChem is a new series of large language models specialized for the domain of chemistry. It is designed to address the critical challenge of model hallucination in scientific applications. For OmniChem, we release this 7B instruction-tuned model with strong reasoning capabilities.
+OmniChem brings the following key innovations:
+Systematic Hallucination Mitigation: Significantly mitigates model hallucination by internalizing physical constraints and structured reasoning patterns, reducing the generation of factually incorrect text.
+Expert-Level Chemistry Capabilities: Demonstrates high performance in core chemistry research tasks, including photophysical property modulation, physicochemical property optimization, and synthesis planning.
+Built on a Strong Foundation: Built upon Qwen2.5-7B-Instruct through continued pre-training on a 5-billion-token specialized corpus and fine-tuned with 170K QA pairs and 420K Chain-of-Thought (CoT) entries.
+This repo contains the instruction-tuned 7B OmniChem model, which has the following features:
+Type: Causal Language Model, Specialized for Chemistry
+Training Stage: Continued Pre-training & Fine-tuning on Qwen2.5-7B-Instruct
+Architecture: Transformer with RoPE, SwiGLU, and RMSNorm
+Number of Parameters: 7B
+Number of Attention Heads (GQA): 28 for Q and 4 for KV
+Context Length: Supports up to 128K tokens
+License: CC BY-NC-SA 4.0 (for academic, non-commercial use)
+For more details, please refer to our [Paper (Link to your paper)] and [Project GitHub (Link to your GitHub)].
+Requirements
+The code for OmniChem is compatible with the latest Hugging Face transformers library. We advise you to use version 4.40.0 or higher. Using older versions may result in unexpected errors.
+pip install --upgrade transformers
+Quickstart
+Here is a code snippet showing how to load the OmniChem model and tokenizer to generate content for a chemistry-related query.
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+# Make sure to replace "YourUsername/OmniChem-7B-Instruct" with your actual model path on the Hub
+model_name = "YourUsername/OmniChem-7B-v1"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto", # or torch.bfloat16 for better performance
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+# Example prompt for a chemistry task
+prompt = "Plan a synthetic route for the small molecule drug lidocaine."
+messages = [
+    {"role": "system", "content": "You are OmniChem, a highly reliable AI assistant for chemistry research. Provide accurate, fact-based, and logical responses."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=1024,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.9,
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+Processing Long Texts
+To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation. For supported frameworks, you can add the following to config.json to enable YaRN for contexts up to 128K tokens:
+{
+  ...,
+  "rope_scaling": {
+    "factor": 4.0,
+    "original_max_position_embeddings": 32768,
+    "type": "yarn"
+  }
+}
+We advise adding the rope_scaling configuration only when processing long contexts is required, as it may impact performance on shorter texts.