m2v-multilingual-european

The minishlab/M2V_multilingual_output model (distilled from LaBSE), pruned to European languages only.

What is this?

This is the original M2V multilingual model with all non-European script tokens removed. The base model was distilled from LaBSE (Language-agnostic BERT Sentence Embedding, 470M params) by the MinishLab team. We pruned the vocabulary to only keep European-script tokens.

Stats

	Before pruning	After pruning
Vocabulary	501,054 tokens	357,416 tokens
Model size	~490 MB	~350 MB
Embedding dim	256	256

28.7% of tokens were removed (non-European scripts).

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("flipbitsnotburgers/m2v-multilingual-european")
embeddings = model.encode(["deodorant", "Duschgel", "shower gel"])

Pruned scripts

The following scripts were removed:

CJK (Chinese, Japanese Kanji)
Hangul (Korean)
Hiragana & Katakana (Japanese)
Arabic
Hebrew
Thai, Lao
Devanagari, Bengali, Tamil, Telugu, and other Indic scripts
Myanmar, Ethiopic, Tibetan, Khmer

License

MIT (same as base model)

Downloads last month: 2

Safetensors

Model size

91.5M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for flipbitsnotburgers/m2v-multilingual-european

Base model

sentence-transformers/LaBSE

Quantized

minishlab/M2V_multilingual_output

Finetuned

(1)

this model