Model Card for PyMUSAS Neural Multilingual Base BEM
A fine tuned 307 Million (307M) parameter Multilingual ModernBERT architecture semantic tagger. The tagger outputs semantic tags at the token level from the USAS tagset.
The semantic tagger is a variation of the Bi-Encoder Model (BEM) from Blevins and Zettlemoyer 2020 a Word Sense Disambiguation (WSD) model.
Table of contents
Quick start
Installation
Requires Python 3.10 or greater, it is best that you install the version of PyTorch you would like to use, e.g. CPU/GPU version etc before installing this package else you will get the default version of PyTorch for your operating system/setup, but we do require torch>=2.2,<3.0.
pip install wsd-torch-models
Usage
from transformers import AutoTokenizer
import torch
from wsd_torch_models.bem import BEM
if __name__ == "__main__":
wsd_model_name = "ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM"
wsd_model = BEM.from_pretrained(wsd_model_name)
tokenizer = AutoTokenizer.from_pretrained(wsd_model_name)
wsd_model.eval()
# Change this to the device you would like to use, e.g. cpu
model_device = "cpu"
wsd_model.to(device=model_device)
sentence = "The river bank was full of fish"
sentence_tokens = sentence.split()
with torch.inference_mode(mode=True):
# sub_word_tokenizer can be None when None it will download the appropriate tokenizer
# but generally it is better to give it the tokenizer as it saves the operation
# of checking if the tokenizer is already downloaded.
predictions = wsd_model.predict(sentence_tokens, sub_word_tokenizer=tokenizer, top_n=5)
for sentence_token, semantic_tags in zip(sentence_tokens, predictions):
print("Token: "+ sentence_token)
print("Most likely tags: ")
for tag in semantic_tags:
tag_definition = wsd_model.label_to_definition[tag]
print("\t" + tag + ":" + tag_definition)
print()
Model Description
For more details about the model and how it was trained please see the citation/technical report, as well as the links in the model sources section.
Model Sources
The training repository contains the code used to train this model. The inference repository contains the code used to run the model as shown in the usage section.
- Training Repository: https://github.com/UCREL/experimental-wsd
- Inference/Usage Repository: https://github.com/UCREL/WSD-Torch-Models
Model Architecture
| Parameter | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|---|---|---|---|---|
| Layers | 7 | 19 | 22 | 22 |
| Hidden Size | 256 | 512 | 384 | 768 |
| Intermediate Size | 384 | 768 | 1152 | 1152 |
| Attention Heads | 4 | 8 | 6 | 12 |
| Total Parameters | 17M | 68M | 140M | 307M |
| Non-embedding Parameters | 3.9M | 42.4M | 42M | 110M |
| Max Sequence Length | 8,000 | 8,000 | 8,192 | 8,192 |
| Vocabulary Size | 50,368 | 50,368 | 256,000 | 256,000 |
| Tokenizer | ModernBERT | ModernBERT | Gemma 2 | Gemma 2 |
Training Data
The model has been trained on a portion of the ucrelnlp/English-USAS-Mosaico, specifically data/wikipedia_shard_0.jsonl.gz, which contains 1,083 English Wikipedia articles, with 444,880 sentences, 6.6 million tokens, with 5.3 million silver labelled tokens generated by a English rule based semantic tagger.
Evaluation
We have evaluated the models on 5 datasets from 5 different languages, 4 of these datasets are publicly available whereas one (the Irish data) requires permission from the data owner to access it. The results for these models using top 1 and top 5 accuracy results are shown below, for a more comprehensive comparison please see the technical report.
| Dataset | 17M English | 68M English | 140M Multilingual | 307M Multilingual |
|---|---|---|---|---|
| Top 1 | ||||
| Chinese | - | - | 42.2 | 47.9 |
| English | 66.4 | 70.1 | 66.0 | 70.2 |
| Finnish | - | - | 15.8 | 25.9 |
| Irish | - | - | 28.5 | 35.6 |
| Welsh | - | - | 21.7 | 42.0 |
| Top 5 | ||||
| Chinese | - | - | 66.3 | 70.4 |
| English | 87.6 | 90.0 | 88.9 | 90.1 |
| Finnish | - | - | 32.8 | 42.4 |
| Irish | - | - | 47.6 | 51.6 |
| Welsh | - | - | 40.8 | 56.4 |
The publicly available datasets can be found on HuggingFace Hub ucrelnlp/USAS-WSD.
Note the English models have not been evaluated on the non-English datasets as they are unlikely to be able to represent non-English text well or perform well on non-English data.
Citation
Technical report is forthcoming.
Contact Information
- Paul Rayson ([email protected])
- Andrew Moore ([email protected] / [email protected])
- UCREL Research Centre ([email protected]) at Lancaster University.
- Downloads last month
- 18
Model tree for ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM
Base model
jhu-clsp/mmBERT-base