# T³: Test-Time Model Merging for Medical Vision-Language Models [Raza Imam](https://razaimam45.github.io/), Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub \ Mohamed bin Zayed University of Artificial Intelligence [![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![Paper](https://img.shields.io/badge/Paper-ArXiV-red)](https://arxiv.org/abs/2510.27265) [![Weights](https://img.shields.io/badge/Weights-HuggingFace-yellow)](https://huggingface.co/razaimam45/TCube_Merging) This repository provides the official PyTorch implementation of our T³ Medical Model-Merging paper: ![T³ Workflow](figures/method.png) *Figure 1: Dynamic test-time merging workflow of T³* Official implementation of **T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging**, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence. --- ## Key Features - 🧠 **Mutual Information Guidance**: Uses JS divergence to measure model consensus. - ⚡ **Backpropagation-Free**: No gradient updates required during inference. - 🏥 **Medical Modality Agnostic**: Validated consistency on 4x medical imaging domains. - 🚀 **Batch-Wise Efficiency**: Reduces compute cost by 32x vs sample-wise merging. - 📈 **SOTA Performance**: Outperforms 8+ baselines in accuracy & robustness. --- ## Table of Contents - [Installation](#installation) - [Method Overview](#method-overview) - [Folder Structure](#folder-structure) - [Reproducing Results](#reproducing-results) - [Pretrained Weights](#pretrained-weights) - [Datasets](#datasets) - [Citation](#citation) ## Installation 1. Clone repository: ```bash git clone https://github.com/Razaimam45/TCube.git cd T3 ``` 2. Create conda environment: ```bash conda create -n t3 python=3.9 conda activate t3 pip install -r requirements.txt ``` ## Method Overview ### Adaptive Merging via Jensen-Shannon Divergence The interpolation coefficient λ is computed dynamically for each sample using the following equation: ```math λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x))) ``` Where: - `JS` = Jensen-Shannon divergence between pretrained and fine-tuned model predictions. - `σ` = Sigmoid function for smooth scaling. - `γ` = Scaling factor (default=0.5). ### Visual Explanation of the Method Below justifies the method and its effectiveness: ### Dynamic Weighting Based on Model Agreement We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (`p_pt`) and fine-tuned (`p_ft`) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio: ```math R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))} ``` JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions: ```math I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr) ``` where ```math \bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`. ``` This ensures: - \(I(x) = 0\) when models fully agree. - \(I(x) > 0\) when confident predictions disagree. Empirically, \(I(x)\) correlates positively with \(R(x)\), but better distinguishes disagreements, validating its use for adaptive merging. 2. **Mutual Information vs. Entropy** ![MI vs Entropy](figures/mi_v_ent.png) *Figure 3: Relationship between mutual information and entropy for adaptive merging.* 3. **Performance Across Modalities** ![Performance Comparison](figures/results.png) *Figure 4: T³ achieves superior performance across multiple medical imaging modalities.* --- ## Folder Structure Do check our [HuggingFace page](https://huggingface.co/razaimam45/TCube_Merging) for Expert Models and Evaluation Datasets. ``` T3/ ├── clip/ # CLIP model adaptations ├── data/ # Data Utilities ├── utils/ # Helper functions ├── models/ # Put your finetuned models HERE ├── dataset/ # Put your medimeta/medmnist-c eval data HERE ├── baselines.py # Comparison methods ├── t_cube.py # Core T³ implementation ├── BetaMixture.py # Auxiliary models └── README.md # This document ``` --- ## Reproducing Results To reproduce the results from the paper, you can run the `t_cube.py` script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in `baselines.py`. To understand the script better; in `t_cube.py`: - Refer to the `compute_samplewise_tcube_weights` and `compute_samplewise_tcube_weights_MI` functions for entropy (DaWiN baseline) and Our mutual information-based merging. - Check the `evaluate_on_test_set` function for how datasets and severities are processed. - Explore the `evaluate_tcube` function for the merging and evaluation logic. --- ## Pretrained Weights We provide pretrained weights for the following models: 1. **Generalist CLIP**: A pretrained model for general vision-language tasks. 2. **Expert CLIPs**: 4x Fine-tuned models for the following medical imaging domains: - Breast Imaging - Fundoscopy - Cell Microscopy - Retinal OCT If you would like to access these weights, please find them at model card at [https://huggingface.co/razaimam45/TCube_Merging](https://huggingface.co/razaimam45/TCube_Merging) under `models/finetuned` subfolder. --- ## Datasets We provided `Breast Imaging` evaluation sets on [HuggingFace page](https://huggingface.co/razaimam45/TCube_Merging). Please download from there. If you need to run multiple modalities datasets, just pass `--testset` arg with `'bloodmnist/breastmnist/'`. This will evaluate medmnist-c and medimeta from each modality, resulting in 4 datasets evaluation. If you need all modality datasets, you can find them as follows: * [MedMNIST datasets](https://zenodo.org/records/10519652) | In-Domain _Fine-Tune_ Datasets * [MediMeta datasets](https://zenodo.org/records/7884735) | OOD-B2N _Eval_ Datasets * [MedMNIST-C datasets](https://github.com/francescodisalvo05/medmnistc-api) | OOD-Corruptions _Eval_ Datasets ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ### Citation If you find this work useful, please cite the arXiv version below: ``` @misc{imam2025t3testtimemodelmerging, title={T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis}, author={Raza Imam and Hu Wang and Dwarikanath Mahapatra and Mohammad Yaqub}, year={2025}, eprint={2510.27265}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.27265}, } ``` ## Contact For questions or collaborations, contact [Raza Imam](mailto:raza.imam@mbzuai.ac.ae). Please feel free to raise an issue in facing error in reproducing the results.