UzUDT: Robust Uzbek Neural Dependency Parsing

This repository contains the trained Stanza-style neural models for Uzbek tokenization, morphosyntactic tagging, and dependency parsing, as described in the paper Towards Robust Uzbek Neural Dependency Parsing.

Model Description

The system is designed to handle the agglutinative morphology and resource scarcity of Uzbek. It utilizes a Stanza-like pipeline augmented with:

  1. BERTbek Contextual Embeddings: Utilizing the elmurod1202/bertbek-news-big-cased model with subword-to-word "super-token" fusion.
  2. Morphology-Aware Preprocessing: An improved Apertium-based normalization layer to reduce sparsity.

Performance (UzUDT Test Set)

Evaluated on the 3-star UzUDT treebank (681 sentences).

Metric Score (%)
UPOS 86.10
XPOS 83.96
UAS 74.21
LAS 66.90
UFeats 70.06

Usage

To use these models, download the .pt files to your local directory. You must specify the path to each model component (Tokenizer, POS, DepParse) in the configuration.

import stanza

# Configuration pointing to the local .pt files
config = {
    'tokenize_model_path': './uz_uzudt_tokenizer.pt',
    'pos_model_path': './uz_uzudt-base_tagger.pt',
    'depparse_model_path': './uz_uzudt_nocharlm_parser.pt',
    'use_gpu': True
}

# Initialize the pipeline
# Note: 'lemma' is excluded as it requires a separate model or external Apertium integration
nlp = stanza.Pipeline(lang='uz', processors='tokenize,pos,depparse', **config)

doc = nlp("Oʻzbekistonning poytaxti Toshkent shahridir.")
doc.sentences[0].print_dependencies()
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sanatbek/uzudt

Finetuned
(4)
this model