RaulQF commited on
Commit
dd94e94
·
verified ·
1 Parent(s): ab2b288

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ca
4
+ - es
5
+ multilinguality:
6
+ - multilingual
7
+ pretty_name: NERCat
8
+ tags:
9
+ - NER
10
+ - Catalan
11
+ - NLP
12
+ - television transcriptions
13
+ - manual annotation
14
+ - GLiNER
15
+ task_categories:
16
+ - text-classification
17
+ - token-classification
18
+ task_ids:
19
+ - multi-label-classification
20
+ - named-entity-recognition
21
+ license: apache-2.0
22
+ ---
23
+ # NERCat Classifier
24
+
25
+ ## Model Overview
26
+
27
+ The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.
28
+
29
+ The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`.
30
+
31
+ ## Quickstart
32
+ ```py
33
+ import torch
34
+ from gliner import GLiNER
35
+
36
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
37
+ model = GLiNER.from_pretrained("ugiat/NERCat", load_tokenizer=True).to(device)
38
+
39
+ text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."
40
+
41
+ labels = [
42
+ "Person",
43
+ "Facility",
44
+ "Organization",
45
+ "Location",
46
+ "Product",
47
+ "Event",
48
+ "Date",
49
+ "Law"
50
+ ]
51
+
52
+ entities = model.predict_entities(text, labels, threshold=0.5)
53
+
54
+ for entity in entities:
55
+ print(entity["text"], "=>", entity["label"])
56
+ ```
57
+
58
+
59
+ ## Performance Evaluation
60
+
61
+ We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:
62
+
63
+ | Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 |
64
+ |----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------|
65
+ | Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 |
66
+ | Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 |
67
+ | Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 |
68
+ | Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 |
69
+ | Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 |
70
+ | Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 |
71
+ | Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 |
72
+ | Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 |
73
+
74
+
75
+ ## Fine-Tuning Process
76
+
77
+ The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:
78
+
79
+ - **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets.
80
+ - **Training Setup:**
81
+ - Batch size: 8
82
+ - Steps: 500
83
+ - Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
84
+ - Learning rates:
85
+ - Entity layers: $5 \times 10^{-6}$
86
+ - Other model parameters: $1 \times 10^{-5}$
87
+ - Scheduler: Linear with a warmup ratio of 0.1
88
+ - Evaluation frequency: Every 100 steps
89
+ - Checkpointing: Every 1000 steps
90
+
91
+ The dataset included 13,732 named entity instances across eight categories:
92
+
93
+ ## Other
94
+
95
+ ### Citation Information
96
+
97
+ ```
98
+ @misc{article_id,
99
+ title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
100
+ author = {Marc Bardeli Gámez, Marc Serrano Sanz, Guillem Cadevall Ferreres, Pol Gerdt Basullas, Raul Quijada Ferrero, Francesc Tarres Ruiz}, year = {2025},
101
+ archivePrefix = {arXiv},
102
+ url = {URL_of_the_paper} (PENDING)
103
+ }
104
+ ```