|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation |
|
|
- transformer |
|
|
- causal-lm |
|
|
- pytorch |
|
|
- lime |
|
|
datasets: |
|
|
- HuggingFaceH4/no_robots |
|
|
- databricks/databricks-dolly-15k |
|
|
- HuggingFaceTB/everyday-conversations-llama3.1-2k |
|
|
- Magpie-Align/Magpie-Pro-300K-Filtered |
|
|
- TIGER-Lab/WebInstruct-verified |
|
|
- teknium/GPT4-LLM-Cleaned |
|
|
- yahma/alpaca-cleaned |
|
|
- Dahoas/synthetic-instruct-gptj-pairwise |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
|
|
|
 |
|
|
**LIME-1B Model Card** |
|
|
|
|
|
--- |
|
|
|
|
|
> **Note**: This model serves as proof that a single individual, without any team or institutional backing, can develop an SLM that demonstrates competitive results. |
|
|
> LIME-1B was trained for only ~$1,000 yet delivers quality approaching models trained on hundreds of thousands of dollars of compute-demonstrating exceptional training efficiency. |
|
|
|
|
|
--- |
|
|
|
|
|
# LIME-1B |
|
|
|
|
|
LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a **compact, practical base model** for: |
|
|
|
|
|
- Building RAG systems (context + question → answer) |
|
|
- Assistant-style Q&A and task completion |
|
|
- Summarization, explanation, and rewriting tasks in English |
|
|
|
|
|
> ⚠️ LIME-1B is **not** RLHF/DPO-aligned and does **not** have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1. Model architecture |
|
|
|
|
|
LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices: |
|
|
|
|
|
| Component | Value | |
|
|
|-------------------------|--------------------------------------------| |
|
|
| Architecture | Decoder-only Transformer | |
|
|
| Parameters | 1.0B | |
|
|
| Layers (decoder blocks) | 32 | |
|
|
| d_model | 1536 | |
|
|
| FFN dimension (d_ff) | 6144 | |
|
|
| Attention heads | 24 | |
|
|
| Vocabulary size | 50,000 | |
|
|
| Max sequence length | 512 tokens | |
|
|
| Positional encoding | Sinusoidal | |
|
|
| Norm | RMSNorm | |
|
|
| FFN | SiLU MLP | |
|
|
| Attention | FlashAttention | |
|
|
| Tying of embeddings | Output head tied to embedding | |
|
|
| Precision (training) | Mixed fp32/bf16 (autocast) + grad clipping | |
|
|
|
|
|
|
|
|
## 2. Training data |
|
|
|
|
|
### 2.1 Pretraining |
|
|
|
|
|
The base model is pretrained as a standard causal language model on English web data: |
|
|
|
|
|
- **Corpus**: FineWeb-Edu (CC-MAIN-2025-05 split) |
|
|
- **Language filter**: English-only subset |
|
|
- **Objective**: next-token prediction (causal LM) |
|
|
- **Token budget**: 20B tokens |
|
|
- **Context length**: 512 tokens |
|
|
|
|
|
|
|
|
### 2.2 Instruction fine-tuning (SFT) |
|
|
|
|
|
After pretraining, the model is fine-tuned on a **unified instruction schema**: |
|
|
|
|
|
```text |
|
|
<user> instruction_text <assistant> response_text <eos> |
|
|
``` |
|
|
|
|
|
**SFT Data Mixture** (~97k examples total): |
|
|
|
|
|
- [HuggingFaceTB/everyday-conversations-llama3.1-2k](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k) |
|
|
- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) |
|
|
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) |
|
|
- [teknium/GPT4-LLM-Cleaned](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned) |
|
|
- [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered) |
|
|
- [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Hardware |
|
|
- **GPUs**: 8 × NVIDIA A100 80GB (data parallel) |
|
|
- **Precision**: bfloat16 with gradient clipping (max_norm = 1.0) |
|
|
|
|
|
### Pretraining |
|
|
|
|
|
**Objective**: Cross-entropy loss on next-token prediction |
|
|
|
|
|
**Optimizer**: AdamW |
|
|
- β₁ = 0.9 |
|
|
- β₂ = 0.95 |
|
|
- Weight decay applied to non-norm/non-bias parameters |
|
|
|
|
|
**Learning Rate Schedule**: |
|
|
- Peak LR: ~5e-4 |
|
|
- Polynomial decay to 5e-6 |
|
|
- Warmup: ~5% of total steps |
|
|
|
|
|
### Instruction fine-tuning (SFT) |
|
|
|
|
|
**Objective**: Cross-entropy loss on next-token prediction |
|
|
|
|
|
**Optimizer**: AdamW |
|
|
- β₁ = 0.9 |
|
|
- β₂ = 0.95 |
|
|
- Weight decay applied to non-norm/non-bias parameters |
|
|
|
|
|
**Learning Rate Schedule**: |
|
|
- Peak LR: 8e-5 |
|
|
- Polynomial decay to 1e-5 |
|
|
- Warmup: 10% of total steps |
|
|
|
|
|
## 3. Evaluation Benchmarks |
|
|
|
|
|
The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: [](metrics_chart.png) |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
# Example usage |
|
|
# pip install -U ukraine |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_name = "anarlavrenov/lime-1b-instruct" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
def build_prompt(question): |
|
|
uid = "<user>" |
|
|
aid = "<assistant>" |
|
|
return uid + question + aid |
|
|
|
|
|
question = "Write five questions for a Data Scientist interview." |
|
|
prompt = build_prompt(question) |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
input_length = inputs['input_ids'].shape[1] |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=128, |
|
|
num_beams=4, |
|
|
early_stopping=True, |
|
|
repetition_penalty=1.15, |
|
|
no_repeat_ngram_size=3, |
|
|
min_new_tokens=16, |
|
|
do_sample=False, |
|
|
top_p=None, |
|
|
temperature=None, |
|
|
pad_token_id=tokenizer.pad_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id, |
|
|
) |
|
|
|
|
|
generated_tokens = outputs[0][input_length:] |
|
|
output = tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
|
|
|
|
print(output) |
|
|
|
|
|
# 1. Can you tell us about your experience with data analysis and modeling? |
|
|
# 2. How do you approach data cleaning and preprocessing? |
|
|
# 3. How do you approach data visualization and storytelling? |
|
|
# 4. Can you walk us through a time when you used data to solve a problem? |
|
|
# 5. How do you approach the ethical considerations of data science and machine learning? |
|
|
|
|
|
``` |
|
|
|
|
|
If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation. |
|
|
|
|
|
**Anar Lavrenov** |
|
|
|
|
|
[](https://www.linkedin.com/in/anar-lavrenov/) |
|
|
|
|
|
Feel free to reach out for questions, or feedback about LIME-1B! |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{lime1b2025, |
|
|
title = {LIME-1B: A 1B-parameter English Causal Language Model}, |
|
|
author = {Anar Lavrenov}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}} |
|
|
} |
|
|
``` |