File size: 7,159 Bytes
ffa5fe0 f39b498 3ae1102 ffa5fe0 e2428f1 ffa5fe0 ed00d52 ac923ac 3c59d92 ed00d52 130327f ed00d52 1e1bae2 ed00d52 140f48a ed00d52 e08c1b3 ed00d52 e08c1b3 ed00d52 3ae1102 d7f4fa7 3ae1102 ed00d52 1e1bae2 ed00d52 b7b9459 1e1bae2 ed00d52 1e1bae2 ed00d52 5a68c5a b25899b 5a68c5a b25899b 3e7102a 5a68c5a ed00d52 62af706 ed00d52 62af706 1e1bae2 ed00d52 62af706 ed00d52 7bef81f ed00d52 44583bc ed00d52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
language:
- en
license: apache-2.0
tags:
- text-generation
- transformer
- causal-lm
- pytorch
- lime
datasets:
- HuggingFaceH4/no_robots
- databricks/databricks-dolly-15k
- HuggingFaceTB/everyday-conversations-llama3.1-2k
- Magpie-Align/Magpie-Pro-300K-Filtered
- TIGER-Lab/WebInstruct-verified
- teknium/GPT4-LLM-Cleaned
- yahma/alpaca-cleaned
- Dahoas/synthetic-instruct-gptj-pairwise
pipeline_tag: text-generation
library_name: transformers
---

**LIME-1B Model Card**
---
> **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an SLM that demonstrates competitive results.
> LIME-1B was trained for only ~$1,000 yet delivers quality approaching models trained on hundreds of thousands of dollars of compute-demonstrating exceptional training efficiency.
---
# LIME-1B
LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for:
- Building RAG systems (context + question → answer)
- Assistant-style Q&A and task completion
- Summarization, explanation, and rewriting tasks in English
> ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.
---
## 1. Model architecture
LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:
| Component | Value |
|-------------------------|--------------------------------------------|
| Architecture | Decoder-only Transformer |
| Parameters | 1.0B |
| Layers (decoder blocks) | 32 |
| d_model | 1536 |
| FFN dimension (d_ff) | 6144 |
| Attention heads | 24 |
| Vocabulary size | 50,000 |
| Max sequence length | 512 tokens |
| Positional encoding | Sinusoidal |
| Norm | RMSNorm |
| FFN | SiLU MLP |
| Attention | FlashAttention |
| Tying of embeddings | Output head tied to embedding |
| Precision (training) | Mixed fp32/bf16 (autocast) + grad clipping |
## 2. Training data
### 2.1 Pretraining
The base model is pretrained as a standard causal language model on English web data:
- **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split)
- **Language filter**: English-only subset
- **Objective**: next-token prediction (causal LM)
- **Token budget**: 20B tokens
- **Context length**: 512 tokens
### 2.2 Instruction fine-tuning (SFT)
After pretraining, the model is fine-tuned on a **unified instruction schema**:
```text
<user> instruction_text <assistant> response_text <eos>
```
**SFT Data Mixture** (~97k examples total):
- [HuggingFaceTB/everyday-conversations-llama3.1-2k](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k)
- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
- [teknium/GPT4-LLM-Cleaned](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned)
- [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered)
- [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)
## Training Details
### Hardware
- **GPUs**: 8 × NVIDIA A100 80GB (data parallel)
- **Precision**: bfloat16 with gradient clipping (max_norm = 1.0)
### Pretraining
**Objective**: Cross-entropy loss on next-token prediction
**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
**Learning Rate Schedule**:
- Peak LR: ~5e-4
- Polynomial decay to 5e-6
- Warmup: ~5% of total steps
### Instruction fine-tuning (SFT)
**Objective**: Cross-entropy loss on next-token prediction
**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters
**Learning Rate Schedule**:
- Peak LR: 8e-5
- Polynomial decay to 1e-5
- Warmup: 10% of total steps
## 3. Evaluation Benchmarks
The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: [](metrics_chart.png)
## Usage
```python
# Example usage
# pip install -U ukraine
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "anarlavrenov/lime-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
def build_prompt(question):
uid = "<user>"
aid = "<assistant>"
return uid + question + aid
question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=4,
early_stopping=True,
repetition_penalty=1.15,
no_repeat_ngram_size=3,
min_new_tokens=16,
do_sample=False,
top_p=None,
temperature=None,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(output)
# 1. Can you tell us about your experience with data analysis and modeling?
# 2. How do you approach data cleaning and preprocessing?
# 3. How do you approach data visualization and storytelling?
# 4. Can you walk us through a time when you used data to solve a problem?
# 5. How do you approach the ethical considerations of data science and machine learning?
```
If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation.
**Anar Lavrenov**
[](https://www.linkedin.com/in/anar-lavrenov/)
Feel free to reach out for questions, or feedback about LIME-1B!
## Citation
```bibtex
@misc{lime1b2025,
title = {LIME-1B: A 1B-parameter English Causal Language Model},
author = {Anar Lavrenov},
year = {2025},
howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
``` |