File size: 4,334 Bytes
dd94e94 7925fc1 dd94e94 8ea3564 dd94e94 e3a64eb 8ea3564 dd94e94 0882b65 dd94e94 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
language:
- ca
- es
multilinguality:
- multilingual
pretty_name: NERCat
tags:
- NER
- Catalan
- NLP
- television transcriptions
- manual annotation
- GLiNER
task_categories:
- text-classification
- token-classification
task_ids:
- multi-label-classification
- named-entity-recognition
license: apache-2.0
datasets:
- Ugiat/ner-cat
---
# NERCat Classifier
## Model Overview
The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.
The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`.
## Quickstart
```py
import torch
from gliner import GLiNER
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)
text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."
labels = [
"Person",
"Facility",
"Organization",
"Location",
"Product",
"Event",
"Date",
"Law"
]
entities = model.predict_entities(text, labels, threshold=0.5)
for entity in entities:
print(entity["text"], "=>", entity["label"])
```
## Performance Evaluation
We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:
| Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 |
|----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------|
| Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 |
| Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 |
| Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 |
| Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 |
| Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 |
| Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 |
| Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 |
| Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 |
## Fine-Tuning Process
The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:
- **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets.
- **Training Setup:**
- Batch size: 8
- Steps: 500
- Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
- Learning rates:
- Entity layers: $5 \times 10^{-6}$
- Other model parameters: $1 \times 10^{-5}$
- Scheduler: Linear with a warmup ratio of 0.1
- Evaluation frequency: Every 100 steps
- Checkpointing: Every 1000 steps
The dataset included 13,732 named entity instances across eight categories:
## Other
### Citation Information
```
@misc{article_id,
title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
author = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
year = {2025},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2503.14173}
}
``` |