# T³: Test-Time Model Merging for Medical Vision-Language Models
[Raza Imam](https://razaimam45.github.io/), Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub \
Mohamed bin Zayed University of Artificial Intelligence

[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![Paper](https://img.shields.io/badge/Paper-ArXiV-red)](https://arxiv.org/abs/2510.27265)
[![Weights](https://img.shields.io/badge/Weights-HuggingFace-yellow)](https://huggingface.co/razaimam45/TCube_Merging)

This repository provides the official PyTorch implementation of our T³ Medical Model-Merging paper:    

![T³ Workflow](figures/method.png)  
*Figure 1: Dynamic test-time merging workflow of T³*

Official implementation of **T³: Test-Time Model Merging in Vision-Language Models for Zero-Shot Medical Imaging**, a method for adaptive fusion of pretrained and fine-tuned vision-language models at test time using Jensen-Shannon divergence.

---

## Key Features
- 🧠 **Mutual Information Guidance**: Uses JS divergence to measure model consensus.
- ⚡ **Backpropagation-Free**: No gradient updates required during inference.
- 🏥 **Medical Modality Agnostic**: Validated consistency on 4x medical imaging domains.
- 🚀 **Batch-Wise Efficiency**: Reduces compute cost by 32x vs sample-wise merging.
- 📈 **SOTA Performance**: Outperforms 8+ baselines in accuracy & robustness.

---

## Table of Contents
- [Installation](#installation)
- [Method Overview](#method-overview)
- [Folder Structure](#folder-structure)
- [Reproducing Results](#reproducing-results)
- [Pretrained Weights](#pretrained-weights)
- [Datasets](#datasets)
- [Citation](#citation)

## Installation

1. Clone repository:
```bash
git clone https://github.com/Razaimam45/TCube.git
cd T3
```

2. Create conda environment:
```bash
conda create -n t3 python=3.9
conda activate t3
pip install -r requirements.txt
```

## Method Overview

### Adaptive Merging via Jensen-Shannon Divergence
The interpolation coefficient λ is computed dynamically for each sample using the following equation:

```math
λ(x) = λ_{min} + (λ_{max}-λ_{min})σ(γ⋅JS(p_{pt}(x)‖p_{ft}(x)))
```

Where:
- `JS` = Jensen-Shannon divergence between pretrained and fine-tuned model predictions.
- `σ` = Sigmoid function for smooth scaling.
- `γ` = Scaling factor (default=0.5).

### Visual Explanation of the Method
Below justifies the method and its effectiveness:

### Dynamic Weighting Based on Model Agreement

We propose using Jensen–Shannon (JS) divergence to measure mutual information between pretrained (`p_pt`) and fine-tuned (`p_ft`) model predictions, offering a more robust gauge of joint confidence than entropy-based methods like DaWin's entropy ratio:

```math
R(x) = \frac{\mathcal{H}(p_{ft}(x))}{\mathcal{H}(p_{pt}(x)) + \mathcal{H}(p_{ft}(x))}
```

JS divergence explicitly captures agreement vs. disagreement by comparing full predictive distributions:

```math
I(x) = \frac{1}{2} \Bigl(\mathrm{KL}(p_{pt}(x) \Vert \bar{p}(x)) + \mathrm{KL}(p_{ft}(x) \Vert \bar{p}(x))\Bigr)
```
where
```math
\bar{p}(x) = 0.5 \cdot (p_{pt}(x) + p_{ft}(x))`.
```

 This ensures:
- \(I(x) = 0\) when models fully agree.
- \(I(x) > 0\) when confident predictions disagree.

Empirically, \(I(x)\) correlates positively with \(R(x)\), but better distinguishes disagreements, validating its use for adaptive merging.

2. **Mutual Information vs. Entropy**  
   ![MI vs Entropy](figures/mi_v_ent.png)  
   *Figure 3: Relationship between mutual information and entropy for adaptive merging.*

3. **Performance Across Modalities**  
   ![Performance Comparison](figures/results.png)  
   *Figure 4: T³ achieves superior performance across multiple medical imaging modalities.*

---

## Folder Structure
Do check our [HuggingFace page](https://huggingface.co/razaimam45/TCube_Merging) for Expert Models and Evaluation Datasets.
```
T3/
├── clip/              # CLIP model adaptations
├── data/              # Data Utilities
├── utils/             # Helper functions
├── models/            # Put your finetuned models HERE
├── dataset/           # Put your medimeta/medmnist-c eval data HERE
├── baselines.py       # Comparison methods
├── t_cube.py          # Core T³ implementation
├── BetaMixture.py     # Auxiliary models
└── README.md          # This document
```

---

## Reproducing Results

To reproduce the results from the paper, you can run the `t_cube.py` script. This script handles the evaluation of T³ and its baselines across multiple datasets and severity levels. Additional baselines are available in `baselines.py`.

To understand the script better; in `t_cube.py`:
- Refer to the `compute_samplewise_tcube_weights` and `compute_samplewise_tcube_weights_MI` functions for entropy (DaWiN baseline) and Our mutual information-based merging.
- Check the `evaluate_on_test_set` function for how datasets and severities are processed.
- Explore the `evaluate_tcube` function for the merging and evaluation logic.

---

## Pretrained Weights

We provide pretrained weights for the following models:
1. **Generalist CLIP**: A pretrained model for general vision-language tasks.
2. **Expert CLIPs**: 4x Fine-tuned models for the following medical imaging domains:
   - Breast Imaging
   - Fundoscopy
   - Cell Microscopy
   - Retinal OCT

<!-- If you would like access to these weights, please contact us directly at [Raza Imam](mailto:raza.imam@mbzuai.ac.ae). -->
If you would like to access these weights, please find them at model card at [https://huggingface.co/razaimam45/TCube_Merging](https://huggingface.co/razaimam45/TCube_Merging) under `models/finetuned` subfolder. 

---

## Datasets

We provided `Breast Imaging` evaluation sets on [HuggingFace page](https://huggingface.co/razaimam45/TCube_Merging). Please download from there.

If you need to run multiple modalities datasets, just pass `--testset` arg with `'bloodmnist/breastmnist/'`. This will evaluate medmnist-c and medimeta from each modality, resulting in 4 datasets evaluation.

If you need all modality datasets, you can find them as follows:
* [MedMNIST datasets](https://zenodo.org/records/10519652) | In-Domain _Fine-Tune_ Datasets
* [MediMeta datasets](https://zenodo.org/records/7884735) | OOD-B2N _Eval_ Datasets
* [MedMNIST-C datasets](https://github.com/francescodisalvo05/medmnistc-api) | OOD-Corruptions _Eval_ Datasets

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

### Citation
If you find this work useful, please cite the arXiv version below:
```
@misc{imam2025t3testtimemodelmerging,
      title={T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis}, 
      author={Raza Imam and Hu Wang and Dwarikanath Mahapatra and Mohammad Yaqub},
      year={2025},
      eprint={2510.27265},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.27265}, 
}
```

## Contact
For questions or collaborations, contact [Raza Imam](mailto:raza.imam@mbzuai.ac.ae). Please feel free to raise an issue in facing error in reproducing the results.