Billy-Liu-DUT commited on
Commit
d62df76
·
verified ·
1 Parent(s): 3fd7e17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -3
README.md CHANGED
@@ -1,3 +1,94 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ ---
4
+ Introduction
5
+ OmniChem is a new series of large language models specialized for the domain of chemistry. It is designed to address the critical challenge of model hallucination in scientific applications. For OmniChem, we release this 7B instruction-tuned model with strong reasoning capabilities.
6
+
7
+ OmniChem brings the following key innovations:
8
+
9
+ Systematic Hallucination Mitigation: Significantly mitigates model hallucination by internalizing physical constraints and structured reasoning patterns, reducing the generation of factually incorrect text.
10
+
11
+ Expert-Level Chemistry Capabilities: Demonstrates high performance in core chemistry research tasks, including photophysical property modulation, physicochemical property optimization, and synthesis planning.
12
+
13
+ Built on a Strong Foundation: Built upon Qwen2.5-7B-Instruct through continued pre-training on a 5-billion-token specialized corpus and fine-tuned with 170K QA pairs and 420K Chain-of-Thought (CoT) entries.
14
+
15
+ This repo contains the instruction-tuned 7B OmniChem model, which has the following features:
16
+
17
+ Type: Causal Language Model, Specialized for Chemistry
18
+
19
+ Training Stage: Continued Pre-training & Fine-tuning on Qwen2.5-7B-Instruct
20
+
21
+ Architecture: Transformer with RoPE, SwiGLU, and RMSNorm
22
+
23
+ Number of Parameters: 7B
24
+
25
+ Number of Attention Heads (GQA): 28 for Q and 4 for KV
26
+
27
+ Context Length: Supports up to 128K tokens
28
+
29
+ License: CC BY-NC-SA 4.0 (for academic, non-commercial use)
30
+
31
+ For more details, please refer to our [Paper (Link to your paper)] and [Project GitHub (Link to your GitHub)].
32
+
33
+ Requirements
34
+ The code for OmniChem is compatible with the latest Hugging Face transformers library. We advise you to use version 4.40.0 or higher. Using older versions may result in unexpected errors.
35
+
36
+ pip install --upgrade transformers
37
+
38
+ Quickstart
39
+ Here is a code snippet showing how to load the OmniChem model and tokenizer to generate content for a chemistry-related query.
40
+
41
+ from transformers import AutoModelForCausalLM, AutoTokenizer
42
+ import torch
43
+
44
+ # Make sure to replace "YourUsername/OmniChem-7B-Instruct" with your actual model path on the Hub
45
+ model_name = "YourUsername/OmniChem-7B-v1"
46
+
47
+ model = AutoModelForCausalLM.from_pretrained(
48
+ model_name,
49
+ torch_dtype="auto", # or torch.bfloat16 for better performance
50
+ device_map="auto"
51
+ )
52
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
53
+
54
+ # Example prompt for a chemistry task
55
+ prompt = "Plan a synthetic route for the small molecule drug lidocaine."
56
+ messages = [
57
+ {"role": "system", "content": "You are OmniChem, a highly reliable AI assistant for chemistry research. Provide accurate, fact-based, and logical responses."},
58
+ {"role": "user", "content": prompt}
59
+ ]
60
+
61
+ text = tokenizer.apply_chat_template(
62
+ messages,
63
+ tokenize=False,
64
+ add_generation_prompt=True
65
+ )
66
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
67
+
68
+ generated_ids = model.generate(
69
+ **model_inputs,
70
+ max_new_tokens=1024,
71
+ do_sample=True,
72
+ temperature=0.7,
73
+ top_p=0.9,
74
+ )
75
+ generated_ids = [
76
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
77
+ ]
78
+
79
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
80
+ print(response)
81
+
82
+ Processing Long Texts
83
+ To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation. For supported frameworks, you can add the following to config.json to enable YaRN for contexts up to 128K tokens:
84
+
85
+ {
86
+ ...,
87
+ "rope_scaling": {
88
+ "factor": 4.0,
89
+ "original_max_position_embeddings": 32768,
90
+ "type": "yarn"
91
+ }
92
+ }
93
+
94
+ We advise adding the rope_scaling configuration only when processing long contexts is required, as it may impact performance on shorter texts.