Model Card for PyMUSAS Neural Multilingual Base BEM

A fine tuned 307 Million (307M) parameter Multilingual ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the USAS tagset.

The semantic tagger is a variation of the Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020 a Word Sense Disambiguation (WSD) model.

Quick start

Installation

Requires Python 3.10 or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require torch>=2.2,<3.0.

pip install wsd-torch-models

Usage

from transformers import AutoTokenizer
import torch

from wsd_torch_models.bem import BEM


if __name__ == "__main__": 
    wsd_model_name = "ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM"
    wsd_model = BEM.from_pretrained(wsd_model_name)
    tokenizer = AutoTokenizer.from_pretrained(wsd_model_name, add_prefix_space=True)

    wsd_model.eval()
    # Change this to the device you would like to use, e.g. cpu
    model_device = "cpu"
    wsd_model.to(device=model_device)
    
    sentence = "The river bank was full of fish"
    sentence_tokens = sentence.split()
    
    with torch.inference_mode(mode=True):
        # sub_word_tokenizer can be None when None it will download the appropriate tokenizer
        # but generally it is better to give it the tokenizer as it saves the operation
        # of checking if the tokenizer is already downloaded.
        predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
        
        for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
            print("Token: "+ sentence_token)
            print("Most likely tags: ")
            for tag in semantic_tags:
                tag_definition = wsd_model.label_to_definition[tag]
                print("\t" + tag + ":" + tag_definition)
            print()

Model Description

For more details about the model and how it was trained please see the citation/technical report, as well as the links in the model sources section.

Model Sources

The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the usage section.

Training Repository: https://github.com/UCREL/experimental-wsd
Inference/Usage Repository: https://github.com/UCREL/WSD-Torch-Models

Model Architecture

Parameter	17M English	68M English	140M Multilingual	307M Multilingual
Layers	7	19	22	22
Hidden Size	256	512	384	768
Intermediate Size	384	768	1152	1152
Attention Heads	4	8	6	12
Total Parameters	17M	68M	140M	307M
Non-embedding Parameters	3.9M	42.4M	42M	110M
Max Sequence Length	8,000	8,000	8,192	8,192
Vocabulary Size	50,368	50,368	256,000	256,000
Tokenizer	ModernBERT	ModernBERT	Gemma 2	Gemma 2

Training Data

The model has been trained on a portion of the ucrelnlp/English-USAS-Mosaico, specifically data/wikipedia_shard_0.jsonl.gz, which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.

Evaluation

We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.

Dataset	17M English	68M English	140M Multilingual	307M Multilingual
Top 1
Chinese	-	-	42.2	47.9
English	66.4	70.1	66.0	70.2
Finnish	-	-	15.8	25.9
Irish	-	-	28.5	35.6
Welsh	-	-	21.7	42.0
Top 5
Chinese	-	-	66.3	70.4
English	87.6	90.0	88.9	90.1
Finnish	-	-	32.8	42.4
Irish	-	-	47.6	51.6
Welsh	-	-	40.8	56.4

The publicly available datasets can be found on HuggingFace Hub ucrelnlp/USAS-WSD.

Note the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.

Citation

Paper: Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation

@misc{moore2026creatinghybridruleneural,
      title={Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation}, 
      author={Andrew Moore and Paul Rayson and Dawn Archer and Tim Czerniak and Dawn Knight and Daisy Lal and Gearóid Ó Donnchadha and Mícheál Ó Meachair and Scott Piao and Elaine Uí Dhonnchadha and Johanna Vuorinen and Yan Yabo and Xiaobin Yang},
      year={2026},
      eprint={2601.09648},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.09648}, 
}

Contact Information

Paul Rayson ([email protected])
Andrew Moore ([email protected] / [email protected])
UCREL Research Centre ([email protected]) at Lancaster University.

Downloads last month: 86

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM

Base model

jhu-clsp/mmBERT-base

Finetuned

(57)

this model

Dataset used to train ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM

Collection including ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM

USAS-Neural-Taggers-1.0

Collection

7 items • Updated 16 days ago

Paper for ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM

Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation

Paper • 2601.09648 • Published 21 days ago

ucrelnlp
/

PyMUSAS-Neural-Multilingual-Base-BEM