chart-rvr-3b / README.md

Upload folder using huggingface_hub

423bda7 verified 4 months ago

6.05 kB

	---
	base_model: Qwen/Qwen2.5-VL-3B-Instruct
	library_name: transformers
	model_name: Chart-RVR-3B
	tags:
	- generated_from_trainer
	- grpo
	- trl
	licence: license
	---


	# Model Card for Chart-RVR-3B

	We present the first RL-trained chart-specific explainable model with SOTA performance on OOD datasets.
	The model is trained to predict chart type, data table, CoT reasoning and the final answer.

	# Chart-RVR Inference Demo

	This script demonstrates how to use the [`sanchit97/chart-rvr-3b`](https://huggingface.co/sanchit97/chart-rvr-3b) model for chart-based reasoning using a vision-language interface. It loads a chart image from a URL, prompts the model with a question, and extracts the structured reasoning and final answer.

	## 🧠 System Prompt

	The assistant is instructed to return a structured response using `<think>` and `<answer>` tags. Inside `<think>`, it outputs the chart type, the data table in JSON, and reasoning steps.

	## 🧪 Inference Code

	```python
	from transformers import AutoProcessor, AutoModelForVision2Seq
	from PIL import Image
	import requests
	from io import BytesIO
	import torch
	from qwen_vl_utils import process_vision_info # helper from Qwen repo

	# Load processor and model
	processor = AutoProcessor.from_pretrained("sanchit97/chart-rvr-3b")
	model = AutoModelForVision2Seq.from_pretrained(
	"sanchit97/chart-rvr-3b", device_map="auto", torch_dtype=torch.bfloat16
	)

	# Define the structured system prompt
	SYSTEM_PROMPT = """
	You are a vision-language assistant. You are given a chart image and a query about the chart.
	Think step-by-step about how to answer the query based on the chart image and then provide the final answer.

	### Output format
	Respond with exactly two blocks in order and nothing else:
	<think>
	First output the type of chart in <type>, \
	then output the underlying data table and finally, \
	think step-by-step about how to answer the query based on the chart image \
	and then provide the final answer.
	<type>
	Type of chart - one word from line, bar, stacked bar, pie, histogram, scatterplot, area, stacked area, bubble, treemap.
	</type>
	Next output the data table in the <table></table> tags
	<table>
	json table - for the chart image, output only a JSON object with: "columns": list of column headers, "rows": list-of-lists, one per data row
	No prose, no comments.
	1. Respond with only a JSON object
	2. The JSON must use exactly this schema:
	{
	"columns": [...],
	"rows": [[...], [...],..., [...]]
	}
	3. Do NOT output HTML, Markdown, or commentary. Any deviation gets zero reward.
	</table>
	Provide your reasoning here in steps:
	<step-1>: Provide a description of reasoning
	<step-2>: Gather ALL the appropriate data from the chart
	<step-3>: Break down the query into smaller parts and verify each part with the data
	...
	<step-n>: Do the final calculation or reasoning to derive the answer
	</think>
	<answer>
	Final answer on a single line
	</answer>
	"""

	# Chart image from URL
	image_url = "https://mathmonks.com/wp-content/uploads/2023/01/Parts-Bar-Graph.jpg"
	response = requests.get(image_url)
	image = Image.open(BytesIO(response.content)).convert("RGB")

	# Query about the chart
	prompt = "What is the average of all the bars in the chart?"

	# Build multimodal chat input
	messages = [
	{
	"role": "system",
	"content": SYSTEM_PROMPT
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": prompt},
	],
	},
	]

	# Format text and vision input
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=text,
	images=[image_inputs],
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(model.device)

	# Generate output
	generated_ids = model.generate(**inputs, max_new_tokens=1024)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	print("Generated Output: ", output)
	print("Answer: ", output.split("<answer>")[-1].split("</answer>")[0].strip())
	```




	## Benchmark numbers

	Some Raw numbers (more numbers+paper coming soon!):
	1. ChartQA: 84.56
	2. PlotQA: 78.68
	3. ChartFC: 77.62
	4. EvoChart: 53.36
	5. ChartQAPro: 28.38


	This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct).
	It has been trained using [TRL](https://github.com/huggingface/trl).

	## Training procedure

	<!-- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/chartrl/chartrl/runs/nzaj1tyg) -->


	This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).

	### Framework versions

	- TRL: 0.20.0.dev0
	- Transformers: 4.53.2
	- Pytorch: 2.6.0+cu124
	- Datasets: 3.6.0
	- Tokenizers: 0.21.1

	## Citations

	Cite GRPO as:

	```bibtex
	@article{zhihong2024deepseekmath,
	title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
	author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
	year = 2024,
	eprint = {arXiv:2402.03300},
	}

	```

	Cite TRL as:

	```bibtex
	@misc{vonwerra2022trl,
	title = {{TRL: Transformer Reinforcement Learning}},
	author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
	year = 2020,
	journal = {GitHub repository},
	publisher = {GitHub},
	howpublished = {\url{https://github.com/huggingface/trl}}
	}
	```