--- base_model: unsloth/qwen2.5-vl-7b-instruct-unsloth-bnb-4bit tags: - text-generation-inference - transformers - unsloth - qwen2_5_vl license: apache-2.0 language: - en datasets: - AI4Math/MathVista - unsloth/LaTeX_OCR - mychen76/invoices-and-receipts_ocr_v1 - corto-ai/handwritten-text --- # Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding **Cernis-Thinking** is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription. ## Model Details - **Base Model**: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) - **Training Method**: Group Relative Policy Optimization (GRPO) with GSPO extensions - **Training Data**: ~2,000 samples across 4 document understanding tasks - **Model Size**: 7B parameters - **License**: Apache 2.0 ## Capabilities Cernis-Thinking is trained on four distinct document understanding tasks: 1. **Mathematical Reasoning** - Solves math problems from images with step-by-step reasoning 2. **LaTeX OCR** - Converts mathematical notation images to LaTeX code 3. **Invoice Extraction** - Extracts structured information from invoices and receipts 4. **Handwriting Transcription** - Transcribes handwritten text from images ## Training Details ### Datasets - [AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista) - Mathematical reasoning (filtered for numeric answers) - [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) - LaTeX formula recognition - [mychen76/invoices-and-receipts_ocr_v1](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) - Invoice extraction - [corto-ai/handwritten-text](https://huggingface.co/datasets/corto-ai/handwritten-text) - Handwriting transcription ### Reinforcement Learning Approach The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions: **1. Formatting Reward Function** - Rewards proper use of `` and `` tags - Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts) - Encourages structured, parseable responses **2. Task-Specific Correctness Reward** - **Math**: Exact numeric matching (2.0 points) - **LaTeX/Handwriting**: String similarity with word overlap scoring (0.75-2.0 points) - **Invoices**: Partial credit for extracting key information (1.5 points) **3. ROUGE-like Word Overlap** - For text-heavy tasks, rewards based on word overlap ratio: - >50% overlap: 1.5 points - >30% overlap: 0.75 points - Prevents wasted training on completely wrong outputs ### Training Configuration ```python training_args = GRPOConfig( learning_rate = 5e-6, num_train_epochs = 0.5, per_device_train_batch_size = 1, gradient_accumulation_steps = 2, num_generations = 4, max_prompt_length = 1024, max_completion_length = 1024, # GSPO settings importance_sampling_level = "sequence", loss_type = "dr_grpo", ) ``` ## Usage ### With Transformers ```python from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from PIL import Image # Load model and processor model = Qwen2VLForConditionalGeneration.from_pretrained( "coolAI/cernis-thinking", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking") # Prepare image and prompt image = Image.open("document.jpg") messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between and , then your answer between and "} ] } ] # Prepare inputs text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device) # Generate output_ids = model.generate(**inputs, max_new_tokens=1024) generated_text = processor.batch_decode(output_ids, skip_special_tokens=True) print(generated_text[0]) ``` ### With vLLM (Recommended for Production) ```python from vllm import LLM, SamplingParams from vllm.assets.image import ImageAsset # Initialize vLLM llm = LLM( model="coolAI/cernis-thinking", max_model_len=16384, gpu_memory_utilization=0.8 ) # Prepare prompt prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is the LaTeX code shown in this image? Provide your answer between and <|im_end|>\n<|im_start|>assistant\n" # Sampling parameters sampling_params = SamplingParams( temperature=0.7, top_k=50, max_tokens=1024 ) # Generate outputs = llm.generate( { "prompt": prompt, "multi_modal_data": {"image": ImageAsset("formula.png").pil_image} }, sampling_params=sampling_params ) print(outputs[0].outputs[0].text) ``` ## Example Outputs ### Mathematical Reasoning **Input**: Image of geometry problem **Output**: ``` To solve this parallelogram problem, I need to use the properties: 1. Opposite sides are equal in a parallelogram 2. Angle bisectors create specific relationships... 42 ``` ### LaTeX OCR **Input**: Image of mathematical formula **Output**: ``` \frac{2}{3} < a^{2} \alpha^{2} \leq 1 ``` ### Invoice Extraction **Input**: Invoice image **Output**: ``` Invoice No: 53553822 Date: 07/24/2012 Vendor: Leo Brown Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949 Seller Tax ID: 926-74-9803 Total: $247.50 ``` ## Citation ```bibtex @misc{cernis-thinking-2025, title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding}, author={Your Name}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}} } ``` ## Acknowledgments - Built with [Unsloth](https://github.com/unslothai/unsloth) for efficient VLM training - Base model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) - Training datasets: AI4Math, Unsloth, mychen76, corto-ai ## License Apache 2.0 - Free for commercial and research use