POS Tagging - Token Segmentation & Categories

Simple script to extract tokens and their POS categories using Hugging Face.

from transformers import pipeline

# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"

# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)

# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)

POS Tagging with Stopword Extraction

Automatically detects and extracts nouns and stopwords from text.

This script performs Part-of-Speech (POS) tagging. It correctly reconstructs words, assigns POS labels, and extracts two key word categories:

Nouns & Proper Nouns (NOUN, PROPN) → Important words in the text.
Stopwords (DET, ADP, PRON, AUX, CCONJ, SCONJ, PART) → Articles, prepositions, conjunctions, etc.

from transformers import pipeline

# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")

# Input text
text = "Companies interested in providing the service must take care of signage and information boards."

# Run POS tagging
tokens = pos_pipeline(text)

# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")

# Reconstruct words correctly
words, buffer, labels = [], [], []

for token in tokens:
    raw_word = token["word"]

    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction

# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)

# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")

# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags

# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]

# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)

Multilingual POS Tagging

Overview

This report outlines the evaluation framework and potential training configurations for a multilingual POS tagging model. The model is based on a Transformer architecture and is assessed after a limited number of training epochs.

Expected Ranges

Validation Loss: Typically between 0.02 and 0.1, depending on dataset complexity and regularization.
Overall Precision: Expected to range from 96% to 99%, influenced by dataset diversity and tokenization quality.
Overall Recall: Generally between 96% and 99%, subject to similar factors as precision.
Overall F1-score: Expected range: 96% to 99%, balancing precision and recall.
Overall Accuracy: Can vary between 97% and 99.5%, contingent on language variations and model robustness.
Evaluation Speed: Typically 100-150 samples/sec | 25-40 steps/sec, depending on batch size and hardware.

Training Configurations

Model: Transformer-based architecture (e.g., BERT, RoBERTa, XLM-R)
Training Epochs: 2 to 5, depending on convergence and validation performance.
Batch Size: 1 to 16, balancing memory constraints and stability.
Learning Rate: 1e-6 to 5e-4, adjusted based on optimization dynamics and warm-up strategies.

Downloads last month: 25

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for jordigonzm/mdeberta-v3-base-multilingual-pos-tagger

Base model

microsoft/mdeberta-v3-base

Finetuned

(224)

this model