--- base_model: - Qwen/Qwen2-VL-7B-Instruct language: - zh - en library_name: transformers license: apache-2.0 metrics: - bertscore - bleu pipeline_tag: image-text-to-text tags: - medical --- # EchoVLM (paper implementation) Official PyTorch implementation of the model described in **"[EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence](https://arxiv.org/abs/2509.14977)"**. ## 🤖 Model Details | Item | Value | |-------------|-------------------------------------------------| | Paper | [arXiv:2509.14977](https://arxiv.org/abs/2509.14977) | | Authors | Chaoyin She¹, Ruifang Lu² | | Code | [GitHub repo](https://github.com/Asunatan/EchoVLM) | | Model Hub | [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM) | ## 🔄 Updates - **Sep 19, 2025**: Released model weights on [Hugging Face](https://huggingface.co/chaoyinshe/EchoVLM). - **Sep 17, 2025**: Paper published on [arXiv](https://arxiv.org/abs/2509.14977). - **Coming soon**: V2 with Chain-of-Thought reasoning and reinforcement learning enhancements. ## 🚀 Quick Start ### Using 🤗 Transformers to Chat Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: ```python from transformers import Qwen2VLMOEForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch # ===== 1. Load model & processor ===== model = Qwen2VLMOEForConditionalGeneration.from_pretrained( "chaoyinshe/EchoVLM", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # faster & memory-efficient device_map="auto", ) processor = AutoProcessor.from_pretrained("chaoyinshe/EchoVLM") # The default range for the number of visual tokens per image in the model is 4-16384. # You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost. # min_pixels = 256*28*28 # max_pixels = 1280*28*28 # processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels) messages = [ { "role": "user", "content": [ { "type": "image", "image": "An ultrasound image", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ```
Multi image inference ```python # Messages containing multiple images and a text query messages = [ { "role": "user", "content": [ {"type": "image", "image": "ultrasound image 1"}, {"type": "image", "image": "ultrasound image 2"}, {"type": "text", "text": "帮我给出超声报告"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ```
Batch inference ```python # Sample messages for batch inference messages1 = [ { "role": "user", "content": [ {"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "This patient has a hypoechoic nodule in the left breast. What is the next step in treatment?"}, ], } ] messages2 = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who are you?"}, ] # Combine messages for batch processing messages = [messages1, messages2] # Preparation for batch inference texts = [ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages ] image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Batch Inference generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_texts = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_texts) ```
## 📌 Citation If you use this model or code in your research, please cite: ```bibtex @misc{she2025echovlmdynamicmixtureofexpertsvisionlanguage, title={EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence}, author={Chaoyin She and Ruifang Lu and Lida Chen and Wei Wang and Qinghua Huang}, year={2025}, eprint={2509.14977}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.14977}, } ```