File size: 7,275 Bytes
89ec749 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
---
language:
- en
metrics:
- f1
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---
# DistilBert Food Product Classification Model - Random Word Swapping Augmentation
## Model Details
### Model Description
This model is finetuned on multi-class food product text classification using random word swapping augmentation and distilbert-base-uncased.
- **Developed by:** [DataScienceWFSR](https://huggingface.co/DataScienceWFSR)
- **Model type:** Text Classification
- **Language(s) (NLP):** English
- **Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
### Model Sources
- **Repository:** [https://github.com/WFSRDataScience/SemEval2025Task9](https://github.com/WFSRDataScience/SemEval2025Task9)
- **Paper :** [https://arxiv.org/abs/2504.20703](https://arxiv.org/abs/2504.20703)
## How to Get Started With the Model
Use the code below to get started with the model in PyTorch.
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import pandas as pd
model, category, augmentation = 'distilbert', 'product', 'rw'
repo_id = f"DataScienceWFSR/{model}-food-{category}-{augmentation}"
lb_path = hf_hub_download(repo_id=repo_id, filename=f"labelencoder_{category}.pkl")
lb = pd.read_pickle(lb_path)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
model.eval()
sample = ('Case Number: 039-94 Date Opened: 10/20/1994 Date Closed: 03/06/1995 Recall Class: 1'
' Press Release (Y/N): N Domestic Est. Number: 07188 M Name: PREPARED FOODS Imported '
'Product (Y/N): N Foreign Estab. Number: N/A City: SANTA TERESA State: NM Country: USA'
' Product: HAM, SLICED Problem: BACTERIA Description: LISTERIA '
'Total Pounds Recalled: 3,920 Pounds Recovered: 3,920')
inputs = tokenizer(sample, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
predicted_label = lb.inverse_transform(predictions.numpy())[0]
print(f"The predicted label is: {predicted_label}")
```
## Training Details
### Training Data
Training and Validation data provided by SemEval-2025 Task 9 organizers : `Food Recall Incidents` dataset (only English) [link](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/tree/main/data)
### Training Procedure
#### Training Hyperparameters
- batch_size: `32`
- epochs: `5`
- lr_scheduler: `linear`
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
### Testing Data & Metrics
#### Testing Data
Test data: 997 samples ([link](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/blob/main/data/incidents_test.csv))
#### Metrics
F<sub>1</sub>-macro
### Results
F<sub>1</sub>-macro scores for each model in the official test set utilizing the `text` field per category and subtasks scores (ST1 and ST2) rounded to 3 decimals. With bold, we indicated the model's specific results.
| Model | hazard-category | product-category | hazard | product | ST1 | ST2 |
|----------------------|----------------:|-----------------:|-------:|--------:|------:|------:|
| BERT<sub>base</sub> | 0.747 | 0.757 | 0.581 | 0.170 | 0.753 | 0.382 |
| BERT<sub>CW</sub> | 0.760 | 0.761 | 0.671 | 0.280 | 0.762 | 0.491 |
| BERT<sub>SR</sub> | 0.770 | 0.754 | 0.666 | 0.275 | 0.764 | 0.478 |
| BERT<sub>RW</sub> | 0.752 | 0.757 | 0.651 | 0.275 | 0.756 | 0.467 |
| DistilBERT<sub>base</sub> | 0.761 | 0.757 | 0.593 | 0.154 | 0.760 | 0.378 |
| DistilBERT<sub>CW</sub> | 0.766 | 0.753 | 0.635 | 0.246 | 0.763 | 0.449 |
| DistilBERT<sub>SR</sub> | 0.756 | 0.759 | 0.644 | 0.240 | 0.763 | 0.448 |
| **DistilBERT<sub>RW</sub>** | **0.749** | **0.747** | **0.647** | **0.261** | **0.753** | **0.462** |
| RoBERTa<sub>base</sub> | 0.760 | 0.753 | 0.579 | 0.123 | 0.755 | 0.356 |
| RoBERTa<sub>CW</sub> | 0.773 | 0.739 | 0.630 | 0.000 | 0.760 | 0.315 |
| RoBERTa<sub>SR</sub> | 0.777 | 0.755 | 0.637 | 0.000 | 0.767 | 0.319 |
| RoBERTa<sub>RW</sub> | 0.757 | 0.611 | 0.615 | 0.000 | 0.686 | 0.308 |
| ModernBERT<sub>base</sub> | 0.781 | 0.745 | 0.667 | 0.275 | 0.769 | 0.485 |
| ModernBERT<sub>CW</sub> | 0.761 | 0.712 | 0.609 | 0.252 | 0.741 | 0.441 |
| ModernBERT<sub>SR</sub> | 0.790 | 0.728 | 0.591 | 0.253 | 0.761 | 0.434 |
| ModernBERT<sub>RW</sub> | 0.761 | 0.751 | 0.629 | 0.237 | 0.759 | 0.440 |
## Technical Specifications
### Compute Infrastructure
#### Hardware
NVIDIA A100 80GB and NVIDIA GeForce RTX 3070 Ti
#### Software
| Library | Version | URL |
|-------------------|--------:|---------------------------------------------------------------------|
| Transformers | 4.49.0 | https://huggingface.co/docs/transformers/index |
| PyTorch | 2.6.0 | https://pytorch.org/ |
| SpaCy | 3.8.4 | https://spacy.io/ |
| Scikit-learn | 1.6.0 | https://scikit-learn.org/stable/ |
| Pandas | 2.2.3 | https://pandas.pydata.org/ |
| Optuna | 4.2.1 | https://optuna.org/ |
| NumPy | 2.0.2 | https://numpy.org/ |
| NLP AUG | 1.1.11 | https://nlpaug.readthedocs.io/en/latest/index.html |
| BeautifulSoup4 | 4.12.3 | https://www.crummy.com/software/BeautifulSoup/bs4/doc/# |
## Citation
**BibTeX:**
For the original paper:
```
@inproceedings{brightcookies-semeval2025-task9,
title="BrightCookies at {S}em{E}val-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification},
author="Papadopoulou, Foteini and Mutlu, Osman and Özen, Neris and van der Velden, Bas H. M. and Hendrickx, Iris and Hürriyetoğlu, Ali",
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
}
```
For the SemEval2025 Task9:
```
@inproceedings{semeval2025-task9,
title = "{S}em{E}val-2025 Task 9: The Food Hazard Detection Challenge",
author = "Randl, Korbinian and Pavlopoulos, John and Henriksson, Aron and Lindgren, Tony and Bakagianni, Juli",
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
}
```
## Model Card Authors and Contact
Authors: Foteini Papadopoulou, Osman Mutlu, Neris Özen,
Bas H.M. van der Velden, Iris Hendrickx, Ali Hürriyetoğlu
Contact: [email protected] |