coolAI commited on
Commit
96e5bd4
·
verified ·
1 Parent(s): c5df120

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -6
README.md CHANGED
@@ -8,14 +8,212 @@ tags:
8
  license: apache-2.0
9
  language:
10
  - en
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
14
 
15
- - **Developed by:** coolAI
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/qwen2.5-vl-7b-instruct-unsloth-bnb-4bit
18
 
19
- This qwen2_5_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  license: apache-2.0
9
  language:
10
  - en
11
+ datasets:
12
+ - AI4Math/MathVista
13
+ - unsloth/LaTeX_OCR
14
+ - mychen76/invoices-and-receipts_ocr_v1
15
+ - corto-ai/handwritten-text
16
  ---
17
 
18
+ # Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding
19
 
20
+ **Cernis-Thinking** is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription.
 
 
21
 
22
+ ## Model Details
23
 
24
+ - **Base Model**: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
25
+ - **Training Method**: Group Relative Policy Optimization (GRPO) with GSPO extensions
26
+ - **Training Data**: ~2,000 samples across 4 document understanding tasks
27
+ - **Model Size**: 7B parameters
28
+ - **License**: Apache 2.0
29
+
30
+ ## Capabilities
31
+
32
+ Cernis-Thinking is trained on four distinct document understanding tasks:
33
+
34
+ 1. **Mathematical Reasoning** - Solves math problems from images with step-by-step reasoning
35
+ 2. **LaTeX OCR** - Converts mathematical notation images to LaTeX code
36
+ 3. **Invoice Extraction** - Extracts structured information from invoices and receipts
37
+ 4. **Handwriting Transcription** - Transcribes handwritten text from images
38
+
39
+ ## Training Details
40
+
41
+ ### Datasets
42
+
43
+ - [AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista) - Mathematical reasoning (filtered for numeric answers)
44
+ - [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) - LaTeX formula recognition
45
+ - [mychen76/invoices-and-receipts_ocr_v1](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) - Invoice extraction
46
+ - [corto-ai/handwritten-text](https://huggingface.co/datasets/corto-ai/handwritten-text) - Handwriting transcription
47
+
48
+ ### Reinforcement Learning Approach
49
+
50
+ The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions:
51
+
52
+ **1. Formatting Reward Function**
53
+ - Rewards proper use of `<REASONING>` and `<SOLUTION>` tags
54
+ - Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts)
55
+ - Encourages structured, parseable responses
56
+
57
+ **2. Task-Specific Correctness Reward**
58
+ - **Math**: Exact numeric matching (2.0 points)
59
+ - **LaTeX/Handwriting**: String similarity with word overlap scoring (0.75-2.0 points)
60
+ - **Invoices**: Partial credit for extracting key information (1.5 points)
61
+
62
+ **3. ROUGE-like Word Overlap**
63
+ - For text-heavy tasks, rewards based on word overlap ratio:
64
+ - >50% overlap: 1.5 points
65
+ - >30% overlap: 0.75 points
66
+ - Prevents wasted training on completely wrong outputs
67
+
68
+ ### Training Configuration
69
+
70
+ ```python
71
+ training_args = GRPOConfig(
72
+ learning_rate = 5e-6,
73
+ num_train_epochs = 0.5,
74
+ per_device_train_batch_size = 1,
75
+ gradient_accumulation_steps = 2,
76
+ num_generations = 4,
77
+ max_prompt_length = 1024,
78
+ max_completion_length = 1024,
79
+
80
+ # GSPO settings
81
+ importance_sampling_level = "sequence",
82
+ loss_type = "dr_grpo",
83
+ )
84
+ ```
85
+
86
+ ## Usage
87
+
88
+ ### With Transformers
89
+
90
+ ```python
91
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
92
+ from PIL import Image
93
+
94
+ # Load model and processor
95
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
96
+ "coolAI/cernis-thinking",
97
+ torch_dtype="auto",
98
+ device_map="auto"
99
+ )
100
+ processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking")
101
+
102
+ # Prepare image and prompt
103
+ image = Image.open("document.jpg")
104
+ messages = [
105
+ {
106
+ "role": "user",
107
+ "content": [
108
+ {"type": "image"},
109
+ {"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between <REASONING> and </REASONING>, then your answer between <SOLUTION> and </SOLUTION>"}
110
+ ]
111
+ }
112
+ ]
113
+
114
+ # Prepare inputs
115
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
116
+ inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device)
117
+
118
+ # Generate
119
+ output_ids = model.generate(**inputs, max_new_tokens=1024)
120
+ generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
121
+ print(generated_text[0])
122
+ ```
123
+
124
+ ### With vLLM (Recommended for Production)
125
+
126
+ ```python
127
+ from vllm import LLM, SamplingParams
128
+ from vllm.assets.image import ImageAsset
129
+
130
+ # Initialize vLLM
131
+ llm = LLM(
132
+ model="coolAI/cernis-thinking",
133
+ max_model_len=16384,
134
+ gpu_memory_utilization=0.8
135
+ )
136
+
137
+ # Prepare prompt
138
+ prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is the LaTeX code shown in this image? Provide your answer between <SOLUTION> and </SOLUTION><|im_end|>\n<|im_start|>assistant\n"
139
+
140
+ # Sampling parameters
141
+ sampling_params = SamplingParams(
142
+ temperature=0.7,
143
+ top_k=50,
144
+ max_tokens=1024
145
+ )
146
+
147
+ # Generate
148
+ outputs = llm.generate(
149
+ {
150
+ "prompt": prompt,
151
+ "multi_modal_data": {"image": ImageAsset("formula.png").pil_image}
152
+ },
153
+ sampling_params=sampling_params
154
+ )
155
+
156
+ print(outputs[0].outputs[0].text)
157
+ ```
158
+
159
+ ## Example Outputs
160
+
161
+ ### Mathematical Reasoning
162
+ **Input**: Image of geometry problem
163
+ **Output**:
164
+ ```
165
+ <REASONING>
166
+ To solve this parallelogram problem, I need to use the properties:
167
+ 1. Opposite sides are equal in a parallelogram
168
+ 2. Angle bisectors create specific relationships...
169
+ </REASONING>
170
+
171
+ <SOLUTION>
172
+ 42
173
+ </SOLUTION>
174
+ ```
175
+
176
+ ### LaTeX OCR
177
+ **Input**: Image of mathematical formula
178
+ **Output**:
179
+ ```
180
+ <SOLUTION>
181
+ \frac{2}{3} < a^{2} \alpha^{2} \leq 1
182
+ </SOLUTION>
183
+ ```
184
+
185
+ ### Invoice Extraction
186
+ **Input**: Invoice image
187
+ **Output**:
188
+ ```
189
+ <SOLUTION>
190
+ Invoice No: 53553822
191
+ Date: 07/24/2012
192
+ Vendor: Leo Brown
193
+ Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949
194
+ Seller Tax ID: 926-74-9803
195
+ Total: $247.50
196
+ </SOLUTION>
197
+ ```
198
+
199
+ ## Citation
200
+
201
+ ```bibtex
202
+ @misc{cernis-thinking-2025,
203
+ title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding},
204
+ author={Your Name},
205
+ year={2025},
206
+ publisher={HuggingFace},
207
+ howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}}
208
+ }
209
+ ```
210
+
211
+ ## Acknowledgments
212
+
213
+ - Built with [Unsloth](https://github.com/unslothai/unsloth) for efficient VLM training
214
+ - Base model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
215
+ - Training datasets: AI4Math, Unsloth, mychen76, corto-ai
216
+
217
+ ## License
218
+
219
+ Apache 2.0 - Free for commercial and research use