File size: 7,159 Bytes

ffa5fe0
 
 
 
 
 
 
 
 
 
 
f39b498
3ae1102
 
 
 
 
 
 
ffa5fe0
e2428f1
ffa5fe0
 
 
ed00d52
 
 
 
 
ac923ac
3c59d92
ed00d52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130327f
ed00d52
1e1bae2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed00d52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140f48a
ed00d52
 
 
e08c1b3
 
ed00d52
 
e08c1b3
 
 
ed00d52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ae1102
 
d7f4fa7
3ae1102
ed00d52
 
1e1bae2
 
 
ed00d52
 
 
b7b9459
1e1bae2
ed00d52
 
 
 
1e1bae2
ed00d52
 
5a68c5a
b25899b
 
5a68c5a
b25899b
3e7102a
5a68c5a
ed00d52
 
62af706
 
ed00d52
 
62af706
 
 
 
 
 
 
1e1bae2
 
ed00d52
 
 
 
62af706
 
 
 
ed00d52
 
 
 
 
 
 
 
 
7bef81f
ed00d52
44583bc
 
 
 
 
 
ed00d52

---
language:
- en
license: apache-2.0
tags:
- text-generation
- transformer
- causal-lm
- pytorch
- lime
datasets:
- HuggingFaceH4/no_robots
- databricks/databricks-dolly-15k
- HuggingFaceTB/everyday-conversations-llama3.1-2k
- Magpie-Align/Magpie-Pro-300K-Filtered
- TIGER-Lab/WebInstruct-verified
- teknium/GPT4-LLM-Cleaned
- yahma/alpaca-cleaned
- Dahoas/synthetic-instruct-gptj-pairwise
pipeline_tag: text-generation
library_name: transformers
---


![logo](logo.png)
**LIME-1B Model Card**

---

> **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an SLM that demonstrates competitive results.
> LIME-1B was trained for only ~$1,000 yet delivers quality approaching models trained on hundreds of thousands of dollars of compute-demonstrating exceptional training efficiency.

---

# LIME-1B

LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for:

- Building RAG systems (context + question → answer)  
- Assistant-style Q&A and task completion  
- Summarization, explanation, and rewriting tasks in English  

> ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.

---

## 1. Model architecture

LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:

| Component               | Value                                      |
|-------------------------|--------------------------------------------|
| Architecture            | Decoder-only Transformer                   |
| Parameters              | 1.0B                                       |
| Layers (decoder blocks) | 32                                         |
| d_model                 | 1536                                       |
| FFN dimension (d_ff)    | 6144                                       |
| Attention heads         | 24                                         |
| Vocabulary size         | 50,000                                     |
| Max sequence length     | 512 tokens                                 |
| Positional encoding     | Sinusoidal                                 |
| Norm                    | RMSNorm                                    |
| FFN                     | SiLU MLP                                   |
| Attention               | FlashAttention                             |
| Tying of embeddings     | Output head tied to embedding              |
| Precision (training)    | Mixed fp32/bf16 (autocast) + grad clipping |


## 2. Training data

### 2.1 Pretraining

The base model is pretrained as a standard causal language model on English web data:

- **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split) 
- **Language filter**: English-only subset  
- **Objective**: next-token prediction (causal LM)  
- **Token budget**: 20B tokens  
- **Context length**: 512 tokens  


### 2.2 Instruction fine-tuning (SFT)

After pretraining, the model is fine-tuned on a **unified instruction schema**:

```text
<user> instruction_text <assistant> response_text <eos>
```

**SFT Data Mixture** (~97k examples total):

- [HuggingFaceTB/everyday-conversations-llama3.1-2k](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k)
- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
- [teknium/GPT4-LLM-Cleaned](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned)
- [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered)
- [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

## Training Details

### Hardware
- **GPUs**: 8 × NVIDIA A100 80GB (data parallel)
- **Precision**: bfloat16 with gradient clipping (max_norm = 1.0)

### Pretraining

**Objective**: Cross-entropy loss on next-token prediction

**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters

**Learning Rate Schedule**:
- Peak LR: ~5e-4
- Polynomial decay to 5e-6
- Warmup: ~5% of total steps

### Instruction fine-tuning (SFT)

**Objective**: Cross-entropy loss on next-token prediction

**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters

**Learning Rate Schedule**:
- Peak LR: 8e-5
- Polynomial decay to 1e-5
- Warmup: 10% of total steps

## 3. Evaluation Benchmarks

The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: [![Metrics Chart](metrics_chart.png)](metrics_chart.png)

## Usage
```python
# Example usage
# pip install -U ukraine

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "anarlavrenov/lime-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

def build_prompt(question):
  uid = "<user>"
  aid = "<assistant>"
  return uid + question + aid

question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(question)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    min_new_tokens=16,
    do_sample=False,
    top_p=None,
    temperature=None,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(output)

# 1. Can you tell us about your experience with data analysis and modeling? 
# 2. How do you approach data cleaning and preprocessing? 
# 3. How do you approach data visualization and storytelling? 
# 4. Can you walk us through a time when you used data to solve a problem? 
# 5. How do you approach the ethical considerations of data science and machine learning?

```

If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation.

**Anar Lavrenov**

[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/anar-lavrenov/)

Feel free to reach out for questions, or feedback about LIME-1B!

## Citation
```bibtex
@misc{lime1b2025,
  title         = {LIME-1B: A 1B-parameter English Causal Language Model},
  author        = {Anar Lavrenov},
  year          = {2025},
  howpublished  = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
```