Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction

Official model release for "A Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction: Towards Interpretable and Reproducible Medical AI."
This model jointly encodes chest X-ray images and radiology reports to perform disease prediction, case retrieval, and multimodal explanation.

Developed as part of the final project at HCMUS (University of Science, VNU-HCM).
All code, configurations, and pretrained checkpoints are publicly available to ensure full reproducibility.


Overview

This framework integrates visual features from a Swin Transformer and textual embeddings from ClinicalBERT into a shared semantic space.
It supports both classification and retrieval tasks, providing interpretability through attention and gradient-based attribution.


Model Architecture

  • Image Encoder: Swin Transformer (fine-tuned from ImageNet pretrained weights)
  • Text Encoder: ClinicalBERT (domain-specific language model for radiology)
  • Fusion Module: Cross-modal attention with hybrid feed-forward layers
  • Losses: BCE + Focal Loss for multi-label classification
  • Explainability: Attention maps + Integrated Gradients (via Captum)

Embeddings from both modalities are projected into a unified latent space, enabling retrieval and interpretable cross-modal reasoning.


Dataset

  • Source: NIH Open-i Chest X-ray Dataset
  • Inputs: DICOM chest X-rays and associated XML radiology reports
  • Labels: MeSH-derived disease categories (22 multi-label classes)

Intended Uses

  • Clinical education – search for radiographically and semantically similar cases
  • Research baseline – benchmark for multimodal retrieval and explainable AI
  • Explainability exploration – visualize model reasoning on medical data

Performance Summary

Classification Results

Metric Eval Set (Macro Avg) Test Set (Macro Avg)
F1-score 0.7121 0.7114
AUROC 0.9499 0.9507
Average Precision (AP) 0.7708 0.7623

Strong diagnostic accuracy, especially for common findings such as COPD, Cardiomegaly, and Musculoskeletal disorders.


Retrieval Results

Protocol P@5 mAP MRR nDCG@5
Generalization (test→test) 0.7463 0.0068 0.848 0.938
Historical (test→train) 0.9173 0.0014 0.881 0.9503

The model retrieves clinically coherent and diverse cases, maintaining high ranking precision and generalization.


Explainability Evaluation

Metric Value
Pearson correlation (ρ) 0.9163
IoU₀.₀₅ overlap 0.58

Attention and gradient-based attributions highlight similar clinical regions, ensuring interpretability and reliability.


Limitations & Risks

  • Trained solely on Open-i; performance may vary across institutions.
  • Explanations are for educational and research purposes only.
  • Not certified for real-world clinical diagnostics.

Reproducibility


Acknowledgments


MIT License © 2025 Phu Do Duc (HCMUS)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support