Personal Noun Detector for German

This model is a fine-tuned version of google-bert/bert-base-german-cased on the following dataset.

During inference, we apply the aggregation strategy "simple" from the Hugging Face pipeline API (aggregation_strategy="simple"). This ensures that contiguous tokens sharing the same entity label are grouped and returned as a single entity, improving usability in downstream applications.

The model achieves the following results on the evaluation set:

Precision: 0.9357
Recall: 0.9399
F1: 0.9378

Intended uses & limitations

This token classification model is intended for identifying personal nouns in German texts, including gender-neutral forms (i.e. "Kletter:in, Kletter*innen").

Training and evaluation data

The training dataset consisted of approximately 140.000 tokens. These were extracted from the German Reference Corpus (DeReKo; Kupietz et al. 2010, 2018), with texts from the DPA (Deutsche Presseagentur, ‘German Press Agency’) and the three magazines Brigitte, Zeit Wissen, and Psychologie Heute.

Training procedure

We applied a pre-trained BERT tokenizer to sentences that had already been split into words. Model training followed the default hyperparameters recommended in the Hugging Face token classification tutorial. The corpus was split into training, validation, and test sets, with 80% used for training and 10% each for validation and testing. Model performance was evaluated at the token level on the held-out test set.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
weight_decay: 0.01
num_epochs: 3

Downloads last month: 869

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for gmanzone/personal-noun-detector-german

Base model

google-bert/bert-base-german-cased

Finetuned

(161)

this model