Science Keyword Classification model

We have fine-tuned INDUS-SDE Model for classifying scientific keywords from NASA's Common Metadata Repository (CMR). The project aims to improve the accessibility and organization of Earth observation metadata by predicting associated keywords in an Extreme Multi-Label Classification setting.

Model Overview

  • Base Model: INDUS-SDE, fine-tuned for multi-label classification.
  • Loss Function: The model uses focal loss instead of traditional cross-entropy to address label imbalance by focusing on difficult-to-classify examples.
  • Dataset: NASA's CMR metadata, filtered to remove duplicates and irrelevant labels, resulting in a dataset of 42,474 records and 3,240 labels. You can find the dataset here

Key Features

  • Extreme Multi-Label Classification: Addresses classification with a vast number of potential labels (keywords) and imbalanced frequency.
  • Stratified Splitting: The dataset is split based on provider-id to maintain balanced representation across train, validation, and test sets.
  • Improved Performance: Focal loss with different focusing parameters (γ) was evaluated, showing significant improvements in weighted precision, recall, F1 score, and Jaccard similarity over cross-entropy loss and previous models.

Label Mapping During Inference

After obtaining predictions from the model, we can map the predicted label indices to their actual names using the model.config.id2label dictionary

# Example usage
predicted_indices = [0, 2, 5] # top 3
predicted_labels = [model.config.id2label[idx] for idx in predicted_indices]
print(predicted_labels)

Experiments

  1. Baseline (alpha-1.0.1): Used cross-entropy loss.
  2. Experiment 2 (alpha-1.1.1): Focal loss with γ = 4.
  3. Experiment 3 (alpha-1.1.2): Focal loss with γ = 2.
  4. Experiment 4 (alpha-1.2.1): Focal loss (γ = 2) with stratified splitting.
    1. Experiment 5 (INDUS-SDE-GKR): Focal loss (γ = 2) with stratified splitting with INDUS-SDE base model.

Results

The INDUS-SDE-GKR model outperformed all other configurations, including the previous best alpha-1.2.1. By leveraging domain-specific pre-training on the SDE dataset and a larger context window (1024 tokens), INDUS-SDE achieved a Mean Reciprocal Rank (MRR) of 0.791, compared to 0.782 for alpha-1.2.1 and 0.744 for the ModernBERT-SDE baseline.

Weighted Metrics Comparison INDUS-SDE GKR MRR Comparison
Please find accompanying [technical writeup here](https://github.com/NASA-IMPACT/science-keywords-classification/blob/develop/documents/Science_Keyword_Classification.pdf). ## References
Downloads last month
77
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasa-impact/science-keyword-classification

Finetuned
(1)
this model