Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction

Official model release for "A Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction: Towards Interpretable and Reproducible Medical AI."
This model jointly encodes chest X-ray images and radiology reports to perform disease prediction, case retrieval, and multimodal explanation.

Developed as part of the final project at HCMUS (University of Science, VNU-HCM).
All code, configurations, and pretrained checkpoints are publicly available to ensure full reproducibility.

Overview

This framework integrates visual features from a Swin Transformer and textual embeddings from ClinicalBERT into a shared semantic space.
It supports both classification and retrieval tasks, providing interpretability through attention and gradient-based attribution.

Model Architecture

Image Encoder: Swin Transformer (fine-tuned from ImageNet pretrained weights)
Text Encoder: ClinicalBERT (domain-specific language model for radiology)
Fusion Module: Cross-modal attention with hybrid feed-forward layers
Losses: BCE + Focal Loss for multi-label classification
Explainability: Attention maps + Integrated Gradients (via Captum)

Embeddings from both modalities are projected into a unified latent space, enabling retrieval and interpretable cross-modal reasoning.

Dataset

Source: NIH Open-i Chest X-ray Dataset
Inputs: DICOM chest X-rays and associated XML radiology reports
Labels: MeSH-derived disease categories (22 multi-label classes)

Intended Uses

Clinical education – search for radiographically and semantically similar cases
Research baseline – benchmark for multimodal retrieval and explainable AI
Explainability exploration – visualize model reasoning on medical data

Performance Summary

Classification Results

Metric	Eval Set (Macro Avg)	Test Set (Macro Avg)
F1-score	0.7121	0.7114
AUROC	0.9499	0.9507
Average Precision (AP)	0.7708	0.7623

Strong diagnostic accuracy, especially for common findings such as COPD, Cardiomegaly, and Musculoskeletal disorders.

Retrieval Results

Protocol	P@5	mAP	MRR	nDCG@5
Generalization (test→test)	0.7463	0.0068	0.848	0.938
Historical (test→train)	0.9173	0.0014	0.881	0.9503

The model retrieves clinically coherent and diverse cases, maintaining high ranking precision and generalization.

Explainability Evaluation

Metric	Value
Pearson correlation (ρ)	0.9163
IoU₀.₀₅ overlap	0.58

Attention and gradient-based attributions highlight similar clinical regions, ensuring interpretability and reliability.

Limitations & Risks

Trained solely on Open-i; performance may vary across institutions.
Explanations are for educational and research purposes only.
Not certified for real-world clinical diagnostics.

Reproducibility

Code: GitHub Repository
Pretrained Models & Embeddings: Hugging Face Model Hub
Dataset: NIH Open-i
Configuration Files: included under config/ for exact experiment replication

Acknowledgments

NIH Open-i Dataset
Disease Ontology (DOID)
RadLex Ontology
Swin Transformer (Timm), ClinicalBERT, SciSpaCy, Captum

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support