Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction
Official model release for "A Unified Multimodal Framework for Chest X-Ray Retrieval and Disease Prediction: Towards Interpretable and Reproducible Medical AI."
This model jointly encodes chest X-ray images and radiology reports to perform disease prediction, case retrieval, and multimodal explanation.
Developed as part of the final project at HCMUS (University of Science, VNU-HCM).
All code, configurations, and pretrained checkpoints are publicly available to ensure full reproducibility.
Overview
This framework integrates visual features from a Swin Transformer and textual embeddings from ClinicalBERT into a shared semantic space.
It supports both classification and retrieval tasks, providing interpretability through attention and gradient-based attribution.
Model Architecture
- Image Encoder: Swin Transformer (fine-tuned from ImageNet pretrained weights)
- Text Encoder: ClinicalBERT (domain-specific language model for radiology)
- Fusion Module: Cross-modal attention with hybrid feed-forward layers
- Losses: BCE + Focal Loss for multi-label classification
- Explainability: Attention maps + Integrated Gradients (via Captum)
Embeddings from both modalities are projected into a unified latent space, enabling retrieval and interpretable cross-modal reasoning.
Dataset
- Source: NIH Open-i Chest X-ray Dataset
- Inputs: DICOM chest X-rays and associated XML radiology reports
- Labels: MeSH-derived disease categories (22 multi-label classes)
Intended Uses
- Clinical education – search for radiographically and semantically similar cases
- Research baseline – benchmark for multimodal retrieval and explainable AI
- Explainability exploration – visualize model reasoning on medical data
Performance Summary
Classification Results
| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) |
|---|---|---|
| F1-score | 0.7121 | 0.7114 |
| AUROC | 0.9499 | 0.9507 |
| Average Precision (AP) | 0.7708 | 0.7623 |
Strong diagnostic accuracy, especially for common findings such as COPD, Cardiomegaly, and Musculoskeletal disorders.
Retrieval Results
| Protocol | P@5 | mAP | MRR | nDCG@5 |
|---|---|---|---|---|
| Generalization (test→test) | 0.7463 | 0.0068 | 0.848 | 0.938 |
| Historical (test→train) | 0.9173 | 0.0014 | 0.881 | 0.9503 |
The model retrieves clinically coherent and diverse cases, maintaining high ranking precision and generalization.
Explainability Evaluation
| Metric | Value |
|---|---|
| Pearson correlation (ρ) | 0.9163 |
| IoU₀.₀₅ overlap | 0.58 |
Attention and gradient-based attributions highlight similar clinical regions, ensuring interpretability and reliability.
Limitations & Risks
- Trained solely on Open-i; performance may vary across institutions.
- Explanations are for educational and research purposes only.
- Not certified for real-world clinical diagnostics.
Reproducibility
- Code: GitHub Repository
- Pretrained Models & Embeddings: Hugging Face Model Hub
- Dataset: NIH Open-i
- Configuration Files: included under
config/for exact experiment replication
Acknowledgments
- NIH Open-i Dataset
- Disease Ontology (DOID)
- RadLex Ontology
- Swin Transformer (Timm), ClinicalBERT, SciSpaCy, Captum
MIT License © 2025 Phu Do Duc (HCMUS)