File size: 7,275 Bytes
89ec749
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
language:
- en
metrics:
- f1
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---


# DistilBert Food Product Classification Model - Random Word Swapping Augmentation

## Model Details

### Model Description

This model is finetuned on multi-class food product text classification using random word swapping augmentation and distilbert-base-uncased.

- **Developed by:** [DataScienceWFSR](https://huggingface.co/DataScienceWFSR)
- **Model type:** Text Classification
- **Language(s) (NLP):** English
- **Finetuned from model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)

### Model Sources

- **Repository:** [https://github.com/WFSRDataScience/SemEval2025Task9](https://github.com/WFSRDataScience/SemEval2025Task9)
- **Paper :** [https://arxiv.org/abs/2504.20703](https://arxiv.org/abs/2504.20703)


## How to Get Started With the Model 

Use the code below to get started with the model in PyTorch.

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import pandas as pd

model, category, augmentation = 'distilbert', 'product', 'rw'

repo_id = f"DataScienceWFSR/{model}-food-{category}-{augmentation}"
lb_path = hf_hub_download(repo_id=repo_id, filename=f"labelencoder_{category}.pkl")
lb =  pd.read_pickle(lb_path)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
model.eval()

sample = ('Case Number: 039-94 Date Opened: 10/20/1994 Date Closed: 03/06/1995 Recall Class: 1'
        ' Press Release (Y/N): N Domestic Est. Number: 07188 M Name: PREPARED FOODS Imported '
        'Product (Y/N): N Foreign Estab. Number: N/A City: SANTA TERESA State: NM Country: USA'
        ' Product: HAM, SLICED Problem: BACTERIA Description: LISTERIA '
        'Total Pounds Recalled: 3,920 Pounds Recovered: 3,920')

inputs = tokenizer(sample, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
predicted_label = lb.inverse_transform(predictions.numpy())[0]
print(f"The predicted label is: {predicted_label}")
```


## Training Details

### Training Data

Training and Validation data provided by SemEval-2025 Task 9 organizers : `Food Recall Incidents` dataset (only English) [link](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/tree/main/data)

### Training Procedure

#### Training Hyperparameters

- batch_size: `32`
- epochs: `5`
- lr_scheduler: `linear`


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data & Metrics

#### Testing Data


Test data: 997 samples ([link](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/blob/main/data/incidents_test.csv))

#### Metrics

F<sub>1</sub>-macro

### Results
F<sub>1</sub>-macro scores for each model in the official test set utilizing the `text` field per category and subtasks scores (ST1 and ST2) rounded to 3 decimals. With bold, we indicated the model's specific results.

| Model                | hazard-category | product-category | hazard | product |  ST1  |  ST2  |
|----------------------|----------------:|-----------------:|-------:|--------:|------:|------:|
| BERT<sub>base</sub>         | 0.747 | 0.757 | 0.581 | 0.170 | 0.753 | 0.382 |
| BERT<sub>CW</sub>       | 0.760 | 0.761 | 0.671 | 0.280 | 0.762 | 0.491 |
| BERT<sub>SR</sub>           | 0.770 | 0.754 | 0.666 | 0.275 | 0.764 | 0.478 |
| BERT<sub>RW</sub>           | 0.752 | 0.757 | 0.651 | 0.275 | 0.756 | 0.467 |
| DistilBERT<sub>base</sub>   | 0.761 | 0.757 | 0.593 | 0.154 | 0.760 | 0.378 |
| DistilBERT<sub>CW</sub>     | 0.766 | 0.753 | 0.635 | 0.246 | 0.763 | 0.449 |
| DistilBERT<sub>SR</sub>     | 0.756 | 0.759 | 0.644 | 0.240 | 0.763 | 0.448 |
| **DistilBERT<sub>RW</sub>**     | **0.749** | **0.747** | **0.647** | **0.261** | **0.753** | **0.462** |
| RoBERTa<sub>base</sub>      | 0.760 | 0.753 | 0.579 | 0.123 | 0.755 | 0.356 |
| RoBERTa<sub>CW</sub>        | 0.773 | 0.739 | 0.630 | 0.000 | 0.760 | 0.315 |
| RoBERTa<sub>SR</sub>        | 0.777 | 0.755 | 0.637 | 0.000 | 0.767 | 0.319 |
| RoBERTa<sub>RW</sub>        | 0.757 | 0.611 | 0.615 | 0.000 | 0.686 | 0.308 |
| ModernBERT<sub>base</sub>   | 0.781 | 0.745 | 0.667 | 0.275 | 0.769 | 0.485 |
| ModernBERT<sub>CW</sub>     | 0.761 | 0.712 | 0.609 | 0.252 | 0.741 | 0.441 |
| ModernBERT<sub>SR</sub>     | 0.790 | 0.728 | 0.591 | 0.253 | 0.761 | 0.434 |
| ModernBERT<sub>RW</sub>     | 0.761 | 0.751 | 0.629 | 0.237 | 0.759 | 0.440 |


## Technical Specifications 

### Compute Infrastructure

#### Hardware

NVIDIA A100 80GB and NVIDIA GeForce RTX 3070 Ti 

#### Software

| Library           | Version | URL                                                                 |
|-------------------|--------:|---------------------------------------------------------------------|
| Transformers      |   4.49.0 | https://huggingface.co/docs/transformers/index                      |
| PyTorch           |   2.6.0  | https://pytorch.org/                                                |
| SpaCy             |   3.8.4  | https://spacy.io/                                                   |
| Scikit-learn      |   1.6.0  | https://scikit-learn.org/stable/                                    |
| Pandas            |   2.2.3  | https://pandas.pydata.org/                                          |
| Optuna            |   4.2.1  | https://optuna.org/                                                 |
| NumPy             |   2.0.2  | https://numpy.org/                                                  |
| NLP AUG           |  1.1.11  | https://nlpaug.readthedocs.io/en/latest/index.html                  |
| BeautifulSoup4    |  4.12.3  | https://www.crummy.com/software/BeautifulSoup/bs4/doc/#             |


## Citation

**BibTeX:**

For the original paper:
```
@inproceedings{brightcookies-semeval2025-task9, 
    title="BrightCookies at {S}em{E}val-2025 Task 9: Exploring Data Augmentation for Food Hazard Classification}, 
    author="Papadopoulou, Foteini and Mutlu, Osman  and Özen, Neris and van der Velden, Bas H. M. and Hendrickx, Iris  and Hürriyetoğlu, Ali",
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    month = jul, 
    year = "2025", 
    address = "Vienna, Austria", 
    publisher = "Association for Computational Linguistics", 
} 
```

For the SemEval2025 Task9:
```
@inproceedings{semeval2025-task9, 
    title = "{S}em{E}val-2025 Task 9: The Food Hazard Detection Challenge", 
    author = "Randl, Korbinian and Pavlopoulos, John and Henriksson, Aron and Lindgren, Tony and Bakagianni, Juli", 
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    month = jul, 
    year = "2025", 
    address = "Vienna, Austria", 
    publisher = "Association for Computational Linguistics", 
} 
```

## Model Card Authors and Contact

Authors: Foteini Papadopoulou, Osman Mutlu, Neris Özen,
Bas H.M. van der Velden, Iris Hendrickx, Ali Hürriyetoğlu

Contact: [email protected]