unified-multimodal-chestxray / README.md

ppddddpp

Update README.md

11aaf30 verified 10 days ago

preview code

raw

history blame contribute delete

4.17 kB

metadata

license: mit
tags:
  - chest-xray
  - medical
  - multimodal
  - retrieval
  - explanation
  - clinicalbert
  - swin-transformer
  - deep-learning
  - image-text
datasets:
  - openi
language:
  - en

Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + MedCLIP/Swin)

This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

Predict medical conditions from multimodal input (image + text)
Retrieve similar cases using shared disease-aware embeddings
Provide visual explanations using attention, GAM-CAM Integrated Gradients (IG)

Developed as a final project at HCMUS.

Model Architecture

Image Encoder: Swin Transformer (pretrained) / MedCLIP (pretrained)
Text Encoder (Base): ClinicalBERT
Fusion Module: Cross-modal attention with hybrid FFN layers
Losses: BCE + Focal Loss for multi-label classification

Embeddings from both modalities are projected into a shared joint space, enabling retrieval and explanation.

Training Data

Dataset: NIH Open-i Chest X-ray Dataset
Input Modalities:
- Chest X-ray DICOMs
- Associated XML radiology reports
Labels: MeSH-derived disease categories (multi-label)

Intended Uses

Clinical Education: Case similarity search for radiology students
Research: Baseline for multimodal medical retrieval
Explainability: Visualize disease evidence in both image and text

Model Performance

Classification

The model was evaluated on a held-out evaluation set and a separate test set across 22 disease labels. The highest performance metrics are achieved using the MedCLIP text encoder. Metrics include Precision (Prec), Recall (Rec), F1-score, and AUROC.

Metric	Eval Set (Macro Avg)	Test Set (Macro Avg)
F1-score	0.7967	0.8974
AUROC	0.9664	0.7372
AP	0.9138	0.7648

The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. The MedCLIP configuration significantly improves overall performance.

Retrieval Performance

Retrieval was evaluated under two protocols. Metrics demonstrate strong performance in retrieving relevant cases across different datasets.

Protocol	P@5	mAP	MRR	DCG@5	Avg Time (ms)
Generalization (test → test)	0.7463	0.0068	0.848	0.9381	0.77
Historical (test → train)	0.9173	0.0010	0.881	0.9503	0.58

Explainability Performance

Attribution metrics confirm high visual fidelity, ensuring the model's attention aligns with clinically relevant image regions.

Metric	Value	Interpretation
Pearson correlation ($\rho$)	0.9163	High linear agreement across attribution maps
[email protected]	0.5762	Moderate overlap of top 5% most salient regions
[email protected]	0.2519	Moderate overlap across broader 20% salient regions

The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.

Notes

Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
Lower performance on some rare labels may reflect dataset imbalance in Open-i.

Limitations & Risks

Trained on a public dataset (Open-i) — may not generalize to other hospitals
Explanations are not clinically validated
Not for diagnostic use in real-world settings

Acknowledgments

NIH Open-i Dataset
DOID
RADLEX
Swin Transformer (Timm)
ClinicalBERT (Emily Alsentzer)
MedCLIP (Zifeng Wang et al., EMNLP 2022)
Captum (for IG explanations)
Gam-CAM

ppddddpp
/

unified-multimodal-chestxray