POS Tagging - Token Segmentation & Categories
Simple script to extract tokens and their POS categories using Hugging Face.
from transformers import pipeline
# Load model and tokenizer
pos_pipeline = pipeline("token-classification", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"
# Run POS tagging
words = text.split(" ")
tokens = pos_pipeline(words)
# Print tokens and their categories
for word, group_token in zip(words, tokens):
    print(f"{word:<15}", end=" ")
    for token in group_token:
        print(f"{token['word']:<8} → {token['entity']:<8}", end=" | ")
    print("\n" + "-" * 80)
POS Tagging with Stopword Extraction
Automatically detects and extracts nouns and stopwords from text.
This script performs Part-of-Speech (POS) tagging. It correctly reconstructs words, assigns POS labels, and extracts two key word categories:
- Nouns & Proper Nouns (NOUN, PROPN) → Important words in the text.
- Stopwords (DET, ADP, PRON, AUX, CCONJ, SCONJ, PART) → Articles, prepositions, conjunctions, etc.
from transformers import pipeline
# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
# Input text
text = "Companies interested in providing the service must take care of signage and information boards."
# Run POS tagging
tokens = pos_pipeline(text)
# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
    print(f"{token['word']:10} → {token['entity']}")
# Reconstruct words correctly
words, buffer, labels = [], [], []
for token in tokens:
    raw_word = token["word"]
    if raw_word.startswith("▁"):  # New word starts
        if buffer:
            words.append("".join(buffer))  # Add the completed word
            labels.append(buffer_label)
        buffer = [raw_word.replace("▁", "")]
        buffer_label = token["entity"]
    else:
        buffer.append(raw_word)  # Continue word construction
# Add last word in buffer
if buffer:
    words.append("".join(buffer))
    labels.append(buffer_label)
# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
    print(f"{word:<15} → {label}")
# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"}  # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"}  # Common stopword POS tags
# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]
# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)
Multilingual POS Tagging
Overview
This report outlines the evaluation framework and potential training configurations for a multilingual POS tagging model. The model is based on a Transformer architecture and is assessed after a limited number of training epochs.
Expected Ranges
- Validation Loss: Typically between 0.02and0.1, depending on dataset complexity and regularization.
- Overall Precision: Expected to range from 96%to99%, influenced by dataset diversity and tokenization quality.
- Overall Recall: Generally between 96%and99%, subject to similar factors as precision.
- Overall F1-score: Expected range: 96%to99%, balancing precision and recall.
- Overall Accuracy: Can vary between 97%and99.5%, contingent on language variations and model robustness.
- Evaluation Speed: Typically 100-150 samples/sec|25-40 steps/sec, depending on batch size and hardware.
Training Configurations
- Model: Transformer-based architecture (e.g., BERT, RoBERTa, XLM-R)
- Training Epochs: 2to5, depending on convergence and validation performance.
- Batch Size: 1to16, balancing memory constraints and stability.
- Learning Rate: 1e-6to5e-4, adjusted based on optimization dynamics and warm-up strategies.
- Downloads last month
- 25
Model tree for jordigonzm/mdeberta-v3-base-multilingual-pos-tagger
Base model
microsoft/mdeberta-v3-base