--- language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - 'no' - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh license: mit library_name: sentence-transformers tags: - kazakh - sentence-transformers - transformers - multilingual - sentence-similarity - feature-extraction --- # sultanbi/e5-base-kazakh This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) on a translated [Kazakh version of the SNLI dataset](https://huggingface.co/datasets/sultanbi/snli-kazakh). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) - **Maximum Sequence Length:** 128 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sultanbi/e5-base-kazakh") # Run inference sentences = [ 'query: Ақ көйлек киген аққұба әйел гүлді ағашта ақ құсты ұстап отыр.', 'query: Балалар күлімсіреп, камераға қол бұлғап тұр.', 'query: Үш бала көл жағасында тұр.', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 768] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Evaluation and comparison against multilingual-e5-base in kazakh tasks | Task / Dataset | Metric | multilingual-e5-base | e5-base-kazakh | Δ (absolute) | Description | | ------------------------- | :----------------: | :------------------: | :---------------: | :-----------: | ---------------------------------------------------------- | | Kazakh SNLI (validation set) retrieval | R@1 | 0.433 | 0.887 | +0.454 | Monolingual semantic retrieval on translated SNLI (validation set) | | | R@5 | 0.619 | 0.984 | +0.365 | | | | R@10 | 0.710 | 0.994 | +0.284 | | | STS-B (val, Kazakh) | Pearson | 0.708 | 0.817 | +0.109 | Semantic similarity correlation (machine-translated STS-B) | | | Spearman | 0.721 | 0.825 | +0.104 | | | | Triplet Accuracy | — | 0.802 | 0.941 | +0.139 | Monolingual entailment discrimination | ### Framework Versions - Python: 3.11.13 - Sentence Transformers: 4.1.0 - Transformers: 4.52.4 - PyTorch: 2.6.0+cu124 - Accelerate: 1.8.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2 ## Citation ### BibTeX ```bibtex @article{wang2024multilingual, title={Multilingual E5 Text Embeddings: A Technical Report}, author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu}, journal={arXiv preprint arXiv:2402.05672}, year={2024} } ``` ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ```