File size: 4,478 Bytes
c398175 143682b 416cf70 c398175 e289938 f95b21f e289938 c398175 416cf70 c398175 e289938 c398175 143682b c398175 143682b c398175 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
language: hi
license: mit
tags:
- hindi
- embeddings
- sentence-embeddings
- semantic-search
- text-similarity
datasets:
- custom
pipeline_tag: sentence-similarity
library_name: transformers
---
# Hindi Sentence Embeddings Model
This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.
## Features
- Specialized for Hindi language text
- Advanced transformer architecture with optimized attention mechanism
- Multiple pooling strategies for enhanced semantic representations
- Creates normalized vector representations for semantic similarity
- Supports semantic search and text similarity applications
## Usage
### Installation
```bash
pip install torch sentencepiece scikit-learn matplotlib
git lfs install
git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
cd hindi-embedding-foundational-model
```
### Enhanced RAG System
This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.
#### Setup and Installation
1. Install additional dependencies:
```bash
pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu
```
2. Index your documents:
```bash
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index
```
3. Run in QA mode with LLM:
```bash
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa
```
### Basic Embedding Usage
```python
from hindi_embeddings import HindiEmbedder
# Initialize the embedder
model = HindiEmbedder("path/to/hindi-embedding-foundational-model")
# Encode sentences to embeddings
sentences = [
"मुझे हिंदी भाषा बहुत पसंद है।",
"मैं हिंदी भाषा सीख रहा हूँ।"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Compute similarity between sentences
similarity = model.compute_similarity(sentences[0], sentences[1])
print(f"Similarity: {similarity:.4f}")
# Perform semantic search
query = "भारत की राजधानी"
documents = [
"दिल्ली भारत की राजधानी है।",
"मुंबई भारत का सबसे बड़ा शहर है।",
"हिमालय पर्वत भारत के उत्तर में स्थित है।"
]
results = model.search(query, documents)
for i, result in enumerate(results):
print(f"{i+1}. Score: {result['score']:.4f}")
print(f" Document: {result['document']}")
# Visualize embeddings
example_sentences = [
"मुझे हिंदी में पढ़ना बहुत पसंद है।",
"आज मौसम बहुत अच्छा है।",
"भारत एक विशाल देश है।"
]
model.visualize_embeddings(example_sentences)
```
## Model Details
This model uses an advanced transformer-based architecture with the following enhancements:
- Pre-layer normalization for stable training
- Specialized attention mechanism with relative positional encoding
- Multiple pooling strategies (weighted, mean, attention-based)
- L2-normalized vectors for cosine similarity
Technical specifications:
- Embedding dimension: 768
- Hidden dimension: 768
- Layers: 12
- Attention heads: 12
- Vocabulary size: 50,000
- Context length: 128 tokens
## Applications
- Semantic search and information retrieval
- Text clustering and categorization
- Recommendation systems
- Question answering
- Document similarity comparison
- Content-based filtering
- RAG systems for Hindi language content
## License
This model is released under the MIT License.
## Citation
If you use this model in your research or application, please cite us:
```
@misc{DeepMostInnovations2025hindi,
author = {DeepMost Innovations},
title = {Hindi Sentence Embeddings Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
}
```
|