Personal Noun Detector for German
This model is a fine-tuned version of google-bert/bert-base-german-cased on the following dataset.
During inference, we apply the aggregation strategy "simple" from the Hugging Face pipeline API (aggregation_strategy="simple"). This ensures that contiguous tokens sharing the same entity label are grouped and returned as a single entity, improving usability in downstream applications.
The model achieves the following results on the evaluation set:
- Precision: 0.9357
- Recall: 0.9399
- F1: 0.9378
Intended uses & limitations
This token classification model is intended for identifying personal nouns in German texts, including gender-neutral forms (i.e. "Kletter:in, Kletter*innen").
Training and evaluation data
The training dataset consisted of approximately 140.000 tokens. These were extracted from the German Reference Corpus (DeReKo; Kupietz et al. 2010, 2018), with texts from the DPA (Deutsche Presseagentur, ‘German Press Agency’) and the three magazines Brigitte, Zeit Wissen, and Psychologie Heute.
Training procedure
We applied a pre-trained BERT tokenizer to sentences that had already been split into words. Model training followed the default hyperparameters recommended in the Hugging Face token classification tutorial. The corpus was split into training, validation, and test sets, with 80% used for training and 10% each for validation and testing. Model performance was evaluated at the token level on the held-out test set.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- weight_decay: 0.01
- num_epochs: 3
- Downloads last month
- 869
Model tree for gmanzone/personal-noun-detector-german
Base model
google-bert/bert-base-german-cased