--- language: - ca - es multilinguality: - multilingual pretty_name: NERCat tags: - NER - Catalan - NLP - television transcriptions - manual annotation - GLiNER task_categories: - text-classification - token-classification task_ids: - multi-label-classification - named-entity-recognition license: apache-2.0 datasets: - Ugiat/ner-cat --- # NERCat Classifier ## Model Overview The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan. The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`. ## Quickstart ```py import torch from gliner import GLiNER device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = GLiNER.from_pretrained("Ugiat/NERCat").to(device) text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya." labels = [ "Person", "Facility", "Organization", "Location", "Product", "Event", "Date", "Law" ] entities = model.predict_entities(text, labels, threshold=0.5) for entity in entities: print(entity["text"], "=>", entity["label"]) ``` ## Performance Evaluation We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories: | Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 | |----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------| | Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 | | Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 | | Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 | | Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 | | Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 | | Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 | | Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 | | Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 | ## Fine-Tuning Process The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization: - **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets. - **Training Setup:** - Batch size: 8 - Steps: 500 - Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances - Learning rates: - Entity layers: $5 \times 10^{-6}$ - Other model parameters: $1 \times 10^{-5}$ - Scheduler: Linear with a warmup ratio of 0.1 - Evaluation frequency: Every 100 steps - Checkpointing: Every 1000 steps The dataset included 13,732 named entity instances across eight categories: ## Other ### Citation Information ``` @misc{article_id, title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan}, author = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero}, year = {2025}, archivePrefix = {arXiv}, url = {https://arxiv.org/abs/2503.14173} } ```