File size: 4,334 Bytes
dd94e94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7925fc1
 
dd94e94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ea3564
dd94e94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3a64eb
8ea3564
dd94e94
0882b65
dd94e94
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
language:
- ca
- es
multilinguality:
- multilingual
pretty_name: NERCat
tags:
- NER
- Catalan
- NLP
- television transcriptions
- manual annotation
- GLiNER
task_categories:
- text-classification
- token-classification
task_ids:
- multi-label-classification
- named-entity-recognition
license: apache-2.0
datasets:
- Ugiat/ner-cat
---
# NERCat Classifier

## Model Overview

The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.

The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`.

## Quickstart
```py
import torch
from gliner import GLiNER

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)

text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."

labels = [
    "Person",
    "Facility",
    "Organization",
    "Location",
    "Product",
    "Event",
    "Date",
    "Law"
]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])
```


## Performance Evaluation

We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:

| Entity Type    | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1  |
|----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------|
| Person         | 1.00              | 1.00          | 1.00      | 0.92              | 0.80          | 0.86      | +0.08       | +0.20    | +0.14 |
| Facility       | 0.89              | 1.00          | 0.94      | 0.67              | 0.25          | 0.36      | +0.22       | +0.75    | +0.58 |
| Organization   | 1.00              | 1.00          | 1.00      | 0.72              | 0.62          | 0.67      | +0.28       | +0.38    | +0.33 |
| Location       | 1.00              | 0.97          | 0.99      | 0.83              | 0.54          | 0.66      | +0.17       | +0.43    | +0.33 |
| Product        | 0.96              | 1.00          | 0.98      | 0.63              | 0.21          | 0.31      | +0.34       | +0.79    | +0.67 |
| Event          | 0.88              | 0.88          | 0.88      | 0.60              | 0.38          | 0.46      | +0.28       | +0.50    | +0.41 |
| Date           | 0.88              | 1.00          | 0.93      | 1.00              | 0.07          | 0.13      | -0.13       | +0.93    | +0.80 |
| Law            | 0.67              | 1.00          | 0.80      | 0.00              | 0.00          | 0.00      | +0.67       | +1.00    | +0.80 |


## Fine-Tuning Process

The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:

- **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets.
- **Training Setup:**
  - Batch size: 8
  - Steps: 500
  - Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
  - Learning rates:
    - Entity layers: $5 \times 10^{-6}$
    - Other model parameters: $1 \times 10^{-5}$
  - Scheduler: Linear with a warmup ratio of 0.1
  - Evaluation frequency: Every 100 steps
  - Checkpointing: Every 1000 steps

The dataset included 13,732 named entity instances across eight categories:

## Other

### Citation Information

```
@misc{article_id,
  title        = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
  author       = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
  year         = {2025},
  archivePrefix = {arXiv},
  url          = {https://arxiv.org/abs/2503.14173}
}
```