tingyuansen commited on
Commit
48a0246
·
verified ·
1 Parent(s): 540a156

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - llama-3
8
+ - astronomy
9
+ - astrophysics
10
+ - arxiv
11
+ inference: false
12
+ base_model:
13
+ - meta-llama/Llama-3-8b-hf
14
+ ---
15
+
16
+ # AstroLLaMA-3-8B-Chat_Summary
17
+
18
+ AstroLLaMA-3-8B-Chat_Summary is a specialized chat model for astronomy, developed by fine-tuning the AstroLLaMA-3-8B-Base_Summary model. This model was developed by the AstroMLab team. It is designed for instruction-following and chat-based interactions in the astronomy domain.
19
+
20
+ ## Model Details
21
+
22
+ - **Base Architecture**: LLaMA-3-8b
23
+ - **Base Model**: AstroLLaMA-3-8B-Base_Summary (trained on summarized content from arXiv's astro-ph category papers)
24
+ - **Data Processing**:
25
+ 1. Optical character recognition (OCR) on PDF files using the Nougat tool
26
+ 2. Summarization of OCR'd text using Qwen-2-8B and LLaMA-3.1-8B, reducing content to about 1,000-4,000 tokens per paper
27
+ - **Fine-tuning Method**: Supervised Fine-Tuning (SFT)
28
+ - **SFT Dataset**:
29
+ - 10,356 astronomy-centered conversations generated from arXiv abstracts by GPT-4
30
+ - Full content of LIMA dataset
31
+ - 10,000 samples from Open Orca dataset
32
+ - 10,000 samples from UltraChat dataset
33
+ - **Training Details**:
34
+ - Learning rate: 3 × 10⁻⁷
35
+ - Training epochs: 1
36
+ - Total batch size: 48
37
+ - Maximum token length: 2048
38
+ - Warmup ratio: 0.03
39
+ - Cosine decay schedule for learning rate reduction
40
+ - **Primary Use**: Instruction-following and chat-based interactions for astronomy-related queries
41
+ - **Reference**: Pan et al. 2024 [Link to be added]
42
+
43
+ ## Using the model for chat
44
+
45
+ ```python
46
+ from transformers import AutoModelForCausalLM, AutoTokenizer
47
+ import torch
48
+
49
+ # Load the model and tokenizer
50
+ tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-chat_summary")
51
+ model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-chat_summary", device_map="auto")
52
+
53
+ # Function to generate a response
54
+ def generate_response(prompt, max_length=512):
55
+ full_prompt = f"###Human: {prompt}\n\n###Assistant:"
56
+ inputs = tokenizer(full_prompt, return_tensors="pt", truncation=True, max_length=max_length)
57
+ inputs = inputs.to(model.device)
58
+
59
+ # Generate a response
60
+ with torch.no_grad():
61
+ outputs = model.generate(
62
+ **inputs,
63
+ max_length=max_length,
64
+ num_return_sequences=1,
65
+ do_sample=True,
66
+ pad_token_id=tokenizer.eos_token_id,
67
+ eos_token_id=tokenizer.encode("###Human:", add_special_tokens=False)[0]
68
+ )
69
+
70
+ # Decode and return the response
71
+ response = tokenizer.decode(outputs[0], skip_special_tokens=False)
72
+
73
+ # Extract only the Assistant's response
74
+ assistant_response = response.split("###Assistant:")[-1].strip()
75
+ return assistant_response
76
+
77
+ # Example usage
78
+ user_input = "What are the main components of a galaxy?"
79
+ response = generate_response(user_input)
80
+ print(f"Human: {user_input}")
81
+ print(f"Assistant: {response}")
82
+ ```
83
+
84
+ ## Model Improvements and Performance
85
+
86
+ This model uses the summarized content for training, which has led to improved performance compared to the AIC (Abstract, Introduction, Conclusion) version. The summarization process allows for the inclusion of more comprehensive information from each paper while maintaining a manageable token count.
87
+
88
+ Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
89
+
90
+ | Model | Score (%) |
91
+ |-------|-----------|
92
+ | AstroLLaMA-3-8B-Plus (AstroMLab) | 77.2 |
93
+ | LLaMA-3.1-8B | 73.7 |
94
+ | **<span style="color:green">AstroLLaMA-3-8B-Base_Summary (AstroMLab)</span>** | **<span style="color:green">72.3</span>** |
95
+ | LLaMA-3-8B | 72.0 |
96
+ | AstroLLaMA-3-8B-Base_AIC | 71.9 |
97
+ | **<span style="color:green">AstroLLaMA-3-8B-Chat_Summary (AstroMLab)</span>** | **<span style="color:green">71.5</span>** |
98
+ | Gemma-2-9B | 71.5 |
99
+ | Qwen-2.5-7B | 70.4 |
100
+ | Yi-1.5-9B | 68.4 |
101
+ | InternLM-2.5-7B | 64.0 |
102
+ | Mistral-7B-v0.3 | 63.9 |
103
+ | ChatGLM3-6B | 50.4 |
104
+
105
+ As shown, AstroLLaMA-3-8B-Chat_Summary performs competitively, maintaining most of the performance of the base summary model. This demonstrates the effectiveness of the summarization approach in capturing and retaining key astronomical concepts, even after fine-tuning for chat interactions.
106
+
107
+ We also found that the model trained with summaries leads to better scores in general, especially with the instruct version, demonstrating that information density matters significantly in specialized domain training.
108
+
109
+ While AstroLLaMA-3-8B-Chat_Summary performs well among models in its class, it does not surpass the performance of the base LLaMA-3.1-8B model. This underscores the ongoing challenges in developing specialized models and the need for continued research in this area.
110
+
111
+ This model is released primarily for reproducibility purposes, allowing researchers to track the development process and compare different iterations of AstroLLaMA models.
112
+
113
+ For optimal performance and the most up-to-date capabilities in astronomy-related tasks, we recommend using AstroLLaMA-3.1-8B-Plus, where further improvements have been made. The newer model incorporates expanded training data beyond astro-ph and features a greatly expanded fine-tuning process, resulting in significantly improved performance.
114
+
115
+ ## Ethical Considerations
116
+
117
+ While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.
118
+
119
+ ## Citation
120
+
121
+ If you use this model in your research, please cite:
122
+
123
+ ```
124
+ [Citation for Pan et al. 2024 to be added]
125
+ ```