Model Card for Hinvec Embedding Model

This model follows a decoder-only transformer architecture and has been fine-tuned specifically for a variety of embedding-centric tasks, including text classification, clustering, semantic search (retrieval), bitext mining, and semantic textual similarity (STS). The fine-tuning process is guided by datasets in Hindi (Devanagari script), Hindi (Romanized script), and English, enabling the model to generate robust multilingual embeddings. By leveraging task-specific contrastive and supervised signals during training, the model is optimized to produce discriminative and semantically meaningful vector representations across multiple languages and scripts. This embedding model is particularly suited for applications that require high-quality sentence or document representations, especially in multilingual or cross-lingual settings involving Indic languages.

Model Details

Model Description

This model is instruction-tuned using a contrastive learning objective, enabling it to generate high-quality, semantically meaningful embeddings aligned with instruction-based tasks. During training, a contrastive loss function is employed to encourage the model to bring semantically similar inputs closer in the embedding space while pushing dissimilar pairs apart.

To enhance the embedding quality, we utilize a bidirectional attention mask, allowing the model to fully attend to all tokens in both directions. This is particularly effective in capturing rich contextual relationships within the input sequence, improving the representation of both local and global semantics.

After processing the input through the transformer layers, we apply mean pooling over the token embeddings (excluding padding tokens, if applicable) to derive a single fixed-size sentence embedding. This pooled representation serves as the final output embedding used for downstream tasks.

Developed by: Lingo Research Group at IIT Gandhinagar
Model type: Decoder-only Embedding Model
Language(s) (NLP): Englsih, Hindi(Deva), Hindi(Romanised)
License: Apache 2.0
Finetuned from model: LingoIITGN/Ganga-2-1B

How to Get Started with the Model 👨🏻‍💻

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer

model_name = "Sailesh97/Hinvec"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = "India, officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area."

inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

embedding = outputs.last_hidden_state.squeeze(dim=0).mean(dim=0)

Summary

Model Card Authors

Sailesh97

Model Card Contact

Lingo Research Group at IIT Gandhinagar, India Mail at: [email protected]

Downloads last month: 8

Model tree for Sailesh97/Hinvec

Base model

LingoIITGN/ganga-1b

Finetuned

LingoIITGN/Ganga-2-1B

Finetuned

(1)

this model