UzUDT: Robust Uzbek Neural Dependency Parsing

This repository contains the trained Stanza-style neural models for Uzbek tokenization, morphosyntactic tagging, and dependency parsing, as described in the paper Towards Robust Uzbek Neural Dependency Parsing.

Model Description

The system is designed to handle the agglutinative morphology and resource scarcity of Uzbek. It utilizes a Stanza-like pipeline augmented with:

BERTbek Contextual Embeddings: Utilizing the elmurod1202/bertbek-news-big-cased model with subword-to-word "super-token" fusion.
Morphology-Aware Preprocessing: An improved Apertium-based normalization layer to reduce sparsity.

Performance (UzUDT Test Set)

Evaluated on the 3-star UzUDT treebank (681 sentences).

Metric	Score (%)
UPOS	86.10
XPOS	83.96
UAS	74.21
LAS	66.90
UFeats	70.06

Usage

To use these models, download the .pt files to your local directory. You must specify the path to each model component (Tokenizer, POS, DepParse) in the configuration.

import stanza

# Configuration pointing to the local .pt files
config = {
    'tokenize_model_path': './uz_uzudt_tokenizer.pt',
    'pos_model_path': './uz_uzudt-base_tagger.pt',
    'depparse_model_path': './uz_uzudt_nocharlm_parser.pt',
    'use_gpu': True
}

# Initialize the pipeline
# Note: 'lemma' is excluded as it requires a separate model or external Apertium integration
nlp = stanza.Pipeline(lang='uz', processors='tokenize,pos,depparse', **config)

doc = nlp("Oʻzbekistonning poytaxti Toshkent shahridir.")
doc.sentences[0].print_dependencies()

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sanatbek/uzudt

Base model

elmurod1202/bertbek-news-big-cased

Finetuned

(4)

this model