File size: 4,478 Bytes
c398175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143682b
416cf70
c398175
 
e289938
 
 
 
 
 
 
 
f95b21f
e289938
 
 
 
 
 
 
 
 
 
 
 
 
c398175
 
 
 
 
416cf70
c398175
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e289938
c398175
 
 
 
 
 
 
 
 
 
143682b
 
c398175
 
 
143682b
c398175
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language: hi
license: mit
tags:
  - hindi
  - embeddings
  - sentence-embeddings
  - semantic-search
  - text-similarity
datasets:
  - custom
pipeline_tag: sentence-similarity
library_name: transformers
---

# Hindi Sentence Embeddings Model

This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.

## Features

- Specialized for Hindi language text
- Advanced transformer architecture with optimized attention mechanism
- Multiple pooling strategies for enhanced semantic representations
- Creates normalized vector representations for semantic similarity
- Supports semantic search and text similarity applications

## Usage

### Installation

```bash
pip install torch sentencepiece scikit-learn matplotlib
git lfs install 
git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
cd hindi-embedding-foundational-model
```

### Enhanced RAG System

This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.

#### Setup and Installation

1. Install additional dependencies:
```bash
pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu
```

2. Index your documents:
```bash
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index
```

3. Run in QA mode with LLM:
```bash
python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa
```

### Basic Embedding Usage

```python
from hindi_embeddings import HindiEmbedder

# Initialize the embedder
model = HindiEmbedder("path/to/hindi-embedding-foundational-model")

# Encode sentences to embeddings
sentences = [
    "मुझे हिंदी भाषा बहुत पसंद है।",
    "मैं हिंदी भाषा सीख रहा हूँ।"
]
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity between sentences
similarity = model.compute_similarity(sentences[0], sentences[1])
print(f"Similarity: {similarity:.4f}")

# Perform semantic search
query = "भारत की राजधानी"
documents = [
    "दिल्ली भारत की राजधानी है।",
    "मुंबई भारत का सबसे बड़ा शहर है।",
    "हिमालय पर्वत भारत के उत्तर में स्थित है।"
]
results = model.search(query, documents)
for i, result in enumerate(results):
    print(f"{i+1}. Score: {result['score']:.4f}")
    print(f"   Document: {result['document']}")

# Visualize embeddings
example_sentences = [
    "मुझे हिंदी में पढ़ना बहुत पसंद है।",
    "आज मौसम बहुत अच्छा है।",
    "भारत एक विशाल देश है।"
]
model.visualize_embeddings(example_sentences)
```

## Model Details

This model uses an advanced transformer-based architecture with the following enhancements:

- Pre-layer normalization for stable training
- Specialized attention mechanism with relative positional encoding
- Multiple pooling strategies (weighted, mean, attention-based)
- L2-normalized vectors for cosine similarity

Technical specifications:
- Embedding dimension: 768
- Hidden dimension: 768
- Layers: 12
- Attention heads: 12
- Vocabulary size: 50,000
- Context length: 128 tokens

## Applications

- Semantic search and information retrieval
- Text clustering and categorization
- Recommendation systems
- Question answering
- Document similarity comparison
- Content-based filtering
- RAG systems for Hindi language content

## License

This model is released under the MIT License.

## Citation

If you use this model in your research or application, please cite us:

```
@misc{DeepMostInnovations2025hindi,
  author = {DeepMost Innovations},
  title = {Hindi Sentence Embeddings Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
}
```