lime-1b-instruct / README.md

Adding `Transformers` in the yaml (#2)

e2428f1 verified 4 days ago

7.16 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-generation
	- transformer
	- causal-lm
	- pytorch
	- lime
	datasets:
	- HuggingFaceH4/no_robots
	- databricks/databricks-dolly-15k
	- HuggingFaceTB/everyday-conversations-llama3.1-2k
	- Magpie-Align/Magpie-Pro-300K-Filtered
	- TIGER-Lab/WebInstruct-verified
	- teknium/GPT4-LLM-Cleaned
	- yahma/alpaca-cleaned
	- Dahoas/synthetic-instruct-gptj-pairwise
	pipeline_tag: text-generation
	library_name: transformers
	---


	![logo](logo.png)
	LIME-1B Model Card

	---

	> Note: This model serves as proof that a single individual, without any team or institutional backing, can develop an SLM that demonstrates competitive results.
	> LIME-1B was trained for only ~$1,000 yet delivers quality approaching models trained on hundreds of thousands of dollars of compute-demonstrating exceptional training efficiency.

	---

	# LIME-1B

	LIME-1B is a 1B-parameter, decoder-only Transformer language model trained from scratch on English web data and then instruction-tuned on a curated mixture of assistant-style datasets with and without retrieval context. It is designed as a compact, practical base model for:

	- Building RAG systems (context + question → answer)
	- Assistant-style Q&A and task completion
	- Summarization, explanation, and rewriting tasks in English

	> ⚠️ LIME-1B is not RLHF/DPO-aligned and does not have tool use or multi-turn chat training baked in. It is an instruction-tuned LM, not a fully aligned assistant like ChatGPT.

	---

	## 1. Model architecture

	LIME-1B follows is a decoder-only Transformer with several quality-oriented design choices:

	\| Component \| Value \|
	\|-------------------------\|--------------------------------------------\|
	\| Architecture \| Decoder-only Transformer \|
	\| Parameters \| 1.0B \|
	\| Layers (decoder blocks) \| 32 \|
	\| d_model \| 1536 \|
	\| FFN dimension (d_ff) \| 6144 \|
	\| Attention heads \| 24 \|
	\| Vocabulary size \| 50,000 \|
	\| Max sequence length \| 512 tokens \|
	\| Positional encoding \| Sinusoidal \|
	\| Norm \| RMSNorm \|
	\| FFN \| SiLU MLP \|
	\| Attention \| FlashAttention \|
	\| Tying of embeddings \| Output head tied to embedding \|
	\| Precision (training) \| Mixed fp32/bf16 (autocast) + grad clipping \|


	## 2. Training data

	### 2.1 Pretraining

	The base model is pretrained as a standard causal language model on English web data:

	- Corpus: FineWeb-Edu (CC-MAIN-2025-05 split)
	- Language filter: English-only subset
	- Objective: next-token prediction (causal LM)
	- Token budget: 20B tokens
	- Context length: 512 tokens


	### 2.2 Instruction fine-tuning (SFT)

	After pretraining, the model is fine-tuned on a unified instruction schema:

	```text
	<user> instruction_text <assistant> response_text <eos>
	```

	SFT Data Mixture (~97k examples total):

	- [HuggingFaceTB/everyday-conversations-llama3.1-2k](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k)
	- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
	- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
	- [teknium/GPT4-LLM-Cleaned](https://huggingface.co/datasets/teknium/GPT4-LLM-Cleaned)
	- [Magpie-Align/Magpie-Pro-300K-Filtered](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered)
	- [Dahoas/synthetic-instruct-gptj-pairwise](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise)

	## Training Details

	### Hardware
	- GPUs: 8 × NVIDIA A100 80GB (data parallel)
	- Precision: bfloat16 with gradient clipping (max_norm = 1.0)

	### Pretraining

	Objective: Cross-entropy loss on next-token prediction

	Optimizer: AdamW
	- β₁ = 0.9
	- β₂ = 0.95
	- Weight decay applied to non-norm/non-bias parameters

	Learning Rate Schedule:
	- Peak LR: ~5e-4
	- Polynomial decay to 5e-6
	- Warmup: ~5% of total steps

	### Instruction fine-tuning (SFT)

	Objective: Cross-entropy loss on next-token prediction

	Optimizer: AdamW
	- β₁ = 0.9
	- β₂ = 0.95
	- Weight decay applied to non-norm/non-bias parameters

	Learning Rate Schedule:
	- Peak LR: 8e-5
	- Polynomial decay to 1e-5
	- Warmup: 10% of total steps

	## 3. Evaluation Benchmarks

	The following charts comparing LIME-1B against other models across 8 standard evaluation tasks can be viewed here: [![Metrics Chart](metrics_chart.png)](metrics_chart.png)

	## Usage
	```python
	# Example usage
	# pip install -U ukraine

	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "anarlavrenov/lime-1b-instruct"
	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)

	def build_prompt(question):
	uid = "<user>"
	aid = "<assistant>"
	return uid + question + aid

	question = "Write five questions for a Data Scientist interview."
	prompt = build_prompt(question)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	input_length = inputs['input_ids'].shape[1]

	outputs = model.generate(
	**inputs,
	max_new_tokens=128,
	num_beams=4,
	early_stopping=True,
	repetition_penalty=1.15,
	no_repeat_ngram_size=3,
	min_new_tokens=16,
	do_sample=False,
	top_p=None,
	temperature=None,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	generated_tokens = outputs[0][input_length:]
	output = tokenizer.decode(generated_tokens, skip_special_tokens=True)

	print(output)

	# 1. Can you tell us about your experience with data analysis and modeling?
	# 2. How do you approach data cleaning and preprocessing?
	# 3. How do you approach data visualization and storytelling?
	# 4. Can you walk us through a time when you used data to solve a problem?
	# 5. How do you approach the ethical considerations of data science and machine learning?

	```

	If you use LIME-1B in academic work or public products, please consider citing the model and the underlying datasets according to their respective licenses and documentation.

	Anar Lavrenov

	[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/anar-lavrenov/)

	Feel free to reach out for questions, or feedback about LIME-1B!

	## Citation
	```bibtex
	@misc{lime1b2025,
	title = {LIME-1B: A 1B-parameter English Causal Language Model},
	author = {Anar Lavrenov},
	year = {2025},
	howpublished = {\url{https://huggingface.co/anarlavrenov/LIME-1B}}
	}
	```