SajilAwale commited on
Commit
1b100e4
·
verified ·
1 Parent(s): b324e60

Updated readme

Browse files
Files changed (1) hide show
  1. README.md +40 -1
README.md CHANGED
@@ -5,4 +5,43 @@ language:
5
  base_model:
6
  - nasa-impact/nasa-smd-ibm-v0.1
7
  library_name: transformers
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - nasa-impact/nasa-smd-ibm-v0.1
7
  library_name: transformers
8
+ ---
9
+
10
+ # Science Keyword Classification model
11
+
12
+ We have fine-tuned [INDUS Model](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) for classifying scientific keywords from NASA's Common Metadata Repository (CMR). The project aims to improve the accessibility and organization of Earth observation metadata by predicting associated keywords in an Extreme Multi-Label Classification setting.
13
+
14
+ ## Model Overview
15
+
16
+ - **Base Model:** INDUS, fine-tuned for multi-label classification.
17
+ - **Loss Function:** The model uses focal loss instead of traditional cross-entropy to address label imbalance by focusing on difficult-to-classify examples.
18
+ - **Dataset:** NASA's CMR metadata, filtered to remove duplicates and irrelevant labels, resulting in a training dataset of 42,474 records and 3,240 labels.
19
+
20
+
21
+
22
+ ## Key Features
23
+
24
+ - **Extreme Multi-Label Classification:** Addresses classification with a vast number of potential labels (keywords) and imbalanced frequency.
25
+ - **Stratified Splitting:** The dataset is split based on `provider-id` to maintain balanced representation across train, validation, and test sets.
26
+ - **Improved Performance:** Focal loss with different focusing parameters (γ) was evaluated, showing significant improvements in weighted precision, recall, F1 score, and Jaccard similarity over cross-entropy loss and previous models.
27
+
28
+ ...
29
+ ## Experiments
30
+
31
+ 1. **Baseline (alpha-1.0.1):** Used cross-entropy loss.
32
+ 2. **Experiment 2 (alpha-1.1.1):** Focal loss with γ = 4.
33
+ 3. **Experiment 3 (alpha-1.1.2):** Focal loss with γ = 2.
34
+ 4. **Final (alpha-1.2.1):** Focal loss (γ = 2) with stratified splitting.
35
+
36
+ ## Results
37
+
38
+ The model with focal loss and stratified sampling (alpha-1.2.1) outperformed all other configurations and previous models in terms of precision, recall, F1 score, and Jaccard similarity. The weighted metrics at various threshold for the model can be found below.
39
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f0e7de9cf89c9ed1bf92a2/CvhGPUzA2vJua3uu9H3tA.png)
40
+
41
+ Please find accompanying [technical writeup here](https://github.com/NASA-IMPACT/science-keywords-classification/blob/develop/documents/Science_Keyword_Classification.pdf).
42
+ ## References
43
+
44
+ - RoBERTa: [arXiv](https://arxiv.org/abs/1907.11692)
45
+ - Focal Loss: [arXiv](https://arxiv.org/abs/1708.02002)
46
+ - [NASA CMR](https://cmr.earthdata.nasa.gov/search)
47
+ - [Previous Model API](https://gcmd.nasa-impact.net/docs/)