|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- chest-xray |
|
|
- medical |
|
|
- multimodal |
|
|
- retrieval |
|
|
- explanation |
|
|
- clinicalbert |
|
|
- swin-transformer |
|
|
- deep-learning |
|
|
- image-text |
|
|
datasets: |
|
|
- openi |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + MedCLIP/Swin) |
|
|
|
|
|
This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to: |
|
|
|
|
|
- Predict medical conditions from multimodal input (image + text) |
|
|
- Retrieve similar cases using shared disease-aware embeddings |
|
|
- Provide visual explanations using attention, GAM-CAM Integrated Gradients (IG) |
|
|
|
|
|
> Developed as a final project at HCMUS. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Image Encoder:** Swin Transformer (pretrained) / MedCLIP (pretrained) |
|
|
- **Text Encoder (Base):** ClinicalBERT |
|
|
- **Fusion Module:** Cross-modal attention with hybrid FFN layers |
|
|
- **Losses:** BCE + Focal Loss for multi-label classification |
|
|
|
|
|
Embeddings from both modalities are projected into a **shared joint space**, enabling retrieval and explanation. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Dataset:** [NIH Open-i Chest X-ray Dataset](https://openi.nlm.nih.gov/) |
|
|
- **Input Modalities:** |
|
|
- Chest X-ray DICOMs |
|
|
- Associated XML radiology reports |
|
|
- **Labels:** MeSH-derived disease categories (multi-label) |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Uses |
|
|
* Clinical Education: Case similarity search for radiology students |
|
|
|
|
|
* Research: Baseline for multimodal medical retrieval |
|
|
|
|
|
* Explainability: Visualize disease evidence in both image and text |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
### Classification |
|
|
|
|
|
The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. The highest performance metrics are achieved using the **MedCLIP** text encoder. Metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**. |
|
|
|
|
|
| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) | |
|
|
|--------|----------------------|----------------------| |
|
|
| F1-score | **0.7967** | **0.8974** | |
|
|
| AUROC | **0.9664** | **0.7372** | |
|
|
| AP | **0.9138** | **0.7648** | |
|
|
|
|
|
*The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. The MedCLIP configuration significantly improves overall performance.* |
|
|
|
|
|
--- |
|
|
|
|
|
### Retrieval Performance |
|
|
|
|
|
Retrieval was evaluated under two protocols. Metrics demonstrate strong performance in retrieving relevant cases across different datasets. |
|
|
|
|
|
| Protocol | P@5 | mAP | MRR | DCG@5 | Avg Time (ms) | |
|
|
|----------|-----|-----|-----|-------|---------------| |
|
|
| Generalization (test → test) | 0.7463 | **0.0068** | 0.848 | **0.9381** | 0.77 | |
|
|
| Historical (test → train) | 0.9173 | **0.0010** | 0.881 | 0.9503 | 0.58 | |
|
|
|
|
|
--- |
|
|
|
|
|
### Explainability Performance |
|
|
|
|
|
Attribution metrics confirm high visual fidelity, ensuring the model's attention aligns with clinically relevant image regions. |
|
|
|
|
|
| Metric | Value | Interpretation | |
|
|
|--------|------|-----------------| |
|
|
| Pearson correlation ($\rho$) | **0.9163** | High linear agreement across attribution maps | |
|
|
| [email protected] | **0.5762** | Moderate overlap of top 5\% most salient regions | |
|
|
| [email protected] | **0.2519** | Moderate overlap across broader 20\% salient regions | |
|
|
|
|
|
*The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.* |
|
|
|
|
|
--- |
|
|
|
|
|
### Notes |
|
|
|
|
|
- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query. |
|
|
- Lower performance on some rare labels may reflect dataset imbalance in Open-i. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations & Risks |
|
|
* Trained on a public dataset (Open-i) — may not generalize to other hospitals |
|
|
|
|
|
* Explanations are not clinically validated |
|
|
|
|
|
* Not for diagnostic use in real-world settings |
|
|
|
|
|
--- |
|
|
|
|
|
--- |
|
|
## Acknowledgments |
|
|
* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection) |
|
|
|
|
|
* [DOID](http://purl.obolibrary.org/obo/doid.obo) |
|
|
|
|
|
* [RADLEX](https://bioportal.bioontology.org/ontologies/RADLEX) |
|
|
|
|
|
* Swin Transformer (Timm) |
|
|
|
|
|
* ClinicalBERT (Emily Alsentzer) |
|
|
|
|
|
* MedCLIP (Zifeng Wang et al., EMNLP 2022) |
|
|
|
|
|
* Captum (for IG explanations) |
|
|
|
|
|
* Gam-CAM |
|
|
|
|
|
## Code link: [GitHub](https://github.com/ppddddpp/unified-multimodal-chestxray) |