UzUDT: Robust Uzbek Neural Dependency Parsing
This repository contains the trained Stanza-style neural models for Uzbek tokenization, morphosyntactic tagging, and dependency parsing, as described in the paper Towards Robust Uzbek Neural Dependency Parsing.
Model Description
The system is designed to handle the agglutinative morphology and resource scarcity of Uzbek. It utilizes a Stanza-like pipeline augmented with:
- BERTbek Contextual Embeddings: Utilizing the
elmurod1202/bertbek-news-big-casedmodel with subword-to-word "super-token" fusion. - Morphology-Aware Preprocessing: An improved Apertium-based normalization layer to reduce sparsity.
Performance (UzUDT Test Set)
Evaluated on the 3-star UzUDT treebank (681 sentences).
| Metric | Score (%) |
|---|---|
| UPOS | 86.10 |
| XPOS | 83.96 |
| UAS | 74.21 |
| LAS | 66.90 |
| UFeats | 70.06 |
Usage
To use these models, download the .pt files to your local directory. You must specify the path to each model component (Tokenizer, POS, DepParse) in the configuration.
import stanza
# Configuration pointing to the local .pt files
config = {
'tokenize_model_path': './uz_uzudt_tokenizer.pt',
'pos_model_path': './uz_uzudt-base_tagger.pt',
'depparse_model_path': './uz_uzudt_nocharlm_parser.pt',
'use_gpu': True
}
# Initialize the pipeline
# Note: 'lemma' is excluded as it requires a separate model or external Apertium integration
nlp = stanza.Pipeline(lang='uz', processors='tokenize,pos,depparse', **config)
doc = nlp("Oʻzbekistonning poytaxti Toshkent shahridir.")
doc.sentences[0].print_dependencies()
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Sanatbek/uzudt
Base model
elmurod1202/bertbek-news-big-cased