Ugiat
/

NERCat

+---
+language:
+- ca
+- es
+multilinguality:
+- multilingual
+pretty_name: NERCat
+tags:
+- NER
+- Catalan
+- NLP
+- television transcriptions
+- manual annotation
+- GLiNER
+task_categories:
+- text-classification
+- token-classification
+task_ids:
+- multi-label-classification
+- named-entity-recognition
+license: apache-2.0
+---
+# NERCat Classifier
+## Model Overview
+The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.
+The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`.
+## Quickstart
+```py
+import torch
+from gliner import GLiNER
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = GLiNER.from_pretrained("ugiat/NERCat", load_tokenizer=True).to(device)
+text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."
+labels = [
+    "Person",
+    "Facility",
+    "Organization",
+    "Location",
+    "Product",
+    "Event",
+    "Date",
+    "Law"
+]
+entities = model.predict_entities(text, labels, threshold=0.5)
+for entity in entities:
+    print(entity["text"], "=>", entity["label"])
+```
+## Performance Evaluation
+We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:
+| Entity Type    | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1  |
+|----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------|
+| Person         | 1.00              | 1.00          | 1.00      | 0.92              | 0.80          | 0.86      | +0.08       | +0.20    | +0.14 |
+| Facility       | 0.89              | 1.00          | 0.94      | 0.67              | 0.25          | 0.36      | +0.22       | +0.75    | +0.58 |
+| Organization   | 1.00              | 1.00          | 1.00      | 0.72              | 0.62          | 0.67      | +0.28       | +0.38    | +0.33 |
+| Location       | 1.00              | 0.97          | 0.99      | 0.83              | 0.54          | 0.66      | +0.17       | +0.43    | +0.33 |
+| Product        | 0.96              | 1.00          | 0.98      | 0.63              | 0.21          | 0.31      | +0.34       | +0.79    | +0.67 |
+| Event          | 0.88              | 0.88          | 0.88      | 0.60              | 0.38          | 0.46      | +0.28       | +0.50    | +0.41 |
+| Date           | 0.88              | 1.00          | 0.93      | 1.00              | 0.07          | 0.13      | -0.13       | +0.93    | +0.80 |
+| Law            | 0.67              | 1.00          | 0.80      | 0.00              | 0.00          | 0.00      | +0.67       | +1.00    | +0.80 |
+## Fine-Tuning Process
+The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:
+- **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets.
+- **Training Setup:**
+  - Batch size: 8
+  - Steps: 500
+  - Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
+  - Learning rates:
+    - Entity layers: $5 \times 10^{-6}$
+    - Other model parameters: $1 \times 10^{-5}$
+  - Scheduler: Linear with a warmup ratio of 0.1
+  - Evaluation frequency: Every 100 steps
+  - Checkpointing: Every 1000 steps
+The dataset included 13,732 named entity instances across eight categories:
+## Other
+### Citation Information
+```
+@misc{article_id,
+  title        = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
+  author       = {Marc Bardeli Gámez, Marc Serrano Sanz, Guillem Cadevall Ferreres, Pol Gerdt Basullas, Raul Quijada Ferrero, Francesc Tarres Ruiz},  year         = {2025},
+  archivePrefix = {arXiv},
+  url          = {URL_of_the_paper} (PENDING)
+}
+```