---
language:
- en
license: apache-2.0
tags:
- text-generation
- transformer
- causal-lm
- pytorch
- lime
datasets:
- HuggingFaceH4/no_robots
- databricks/databricks-dolly-15k
- HuggingFaceTB/everyday-conversations-llama3.1-2k
- Magpie-Align/Magpie-Pro-300K-Filtered
- TIGER-Lab/WebInstruct-verified
- teknium/GPT4-LLM-Cleaned
- yahma/alpaca-cleaned
- Dahoas/synthetic-instruct-gptj-pairwise
pipeline_tag: text-generation
library_name: transformers
---


![logo](logo.png)
**LIME-1B Model Card**

---

> **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an SLM that demonstrates competitive results.
> LIME-1B was trained for only ~$1,000 yet delivers quality approaching models trained on hundreds of thousands of dollars of compute-demonstrating exceptional training efficiency.

---

# LIME-1B

LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for:

- Building RAG systems (context + question → answer)  
- Assistant-style Q&A and task completion  
- Summarization, explanation, and rewriting tasks in English  

> ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.

---

## 1. Model architecture

LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:

| Component               | Value                                      |
|-------------------------|--------------------------------------------|
| Architecture            | Decoder-only Transformer                   |
| Parameters              | 1.0B                                       |
| Layers (decoder blocks) | 32                                         |
| d_model                 | 1536                                       |
| FFN dimension (d_ff)    | 6144                                       |
| Attention heads         | 24                                         |
| Vocabulary size         | 50,000                                     |
| Max sequence length     | 512 tokens                                 |
| Positional encoding     | Sinusoidal                                 |
| Norm                    | RMSNorm                                    |
| FFN                     | SiLU MLP                                   |
| Attention               | FlashAttention                             |
| Tying of embeddings     | Output head tied to embedding              |
| Precision (training)    | Mixed fp32/bf16 (autocast) + grad clipping |


## 2. Training data

### 2.1 Pretraining

The base model is pretrained as a standard causal language model on English web data:

- **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split) 
- **Language filter**: English-only subset  
- **Objective**: next-token prediction (causal LM)  
- **Token budget**: 20B tokens  
- **Context length**: 512 tokens  


### 2.2 Instruction fine-tuning (SFT)

After pretraining, the model is fine-tuned on a **unified instruction schema**:

```text
<user> instruction_text <assistant> response_text <eos>
```

**SFT Data Mixture** (~97k examples total):

- [HuggingFaceTB/everyday-conversations-llama3.1-2k](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k)
- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
- [teknium/GPT4-LLM-Cleaned](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned)
- [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered)
- [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

## Training Details

### Hardware
- **GPUs**: 8 × NVIDIA A100 80GB (data parallel)
- **Precision**: bfloat16 with gradient clipping (max_norm = 1.0)

### Pretraining

**Objective**: Cross-entropy loss on next-token prediction

**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters

**Learning Rate Schedule**:
- Peak LR: ~5e-4
- Polynomial decay to 5e-6
- Warmup: ~5% of total steps

### Instruction fine-tuning (SFT)

**Objective**: Cross-entropy loss on next-token prediction

**Optimizer**: AdamW
- β₁ = 0.9
- β₂ = 0.95
- Weight decay applied to non-norm/non-bias parameters

**Learning Rate Schedule**:
- Peak LR: 8e-5
- Polynomial decay to 1e-5
- Warmup: 10% of total steps

## 3. Evaluation Benchmarks

The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: [![Metrics Chart](metrics_chart.png)](metrics_chart.png)

## Usage
```python
# Example usage
# pip install -U ukraine

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "anarlavrenov/lime-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

def build_prompt(question):
  uid = "<user>"
  aid = "<assistant>"
  return uid + question + aid

question = "Write five questions for a Data Scientist interview."
prompt = build_prompt(question)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs['input_ids'].shape[1]

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    num_beams=4,
    early_stopping=True,
    repetition_penalty=1.15,
    no_repeat_ngram_size=3,
    min_new_tokens=16,
    do_sample=False,
    top_p=None,
    temperature=None,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

generated_tokens = outputs[0][input_length:]
output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(output)

# 1. Can you tell us about your experience with data analysis and modeling? 
# 2. How do you approach data cleaning and preprocessing? 
# 3. How do you approach data visualization and storytelling? 
# 4. Can you walk us through a time when you used data to solve a problem? 
# 5. How do you approach the ethical considerations of data science and machine learning?

```

If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation.

**Anar Lavrenov**

[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/anar-lavrenov/)

Feel free to reach out for questions, or feedback about LIME-1B!

## Citation
```bibtex
@misc{lime1b2025,
  title         = {LIME-1B: A 1B-parameter English Causal Language Model},
  author        = {Anar Lavrenov},
  year          = {2025},
  howpublished  = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
}
```