A little guide to building Large Language Models in 2024
Resources mentioned by @thomwolf in https://x.com/Thom_Wolf/status/1773340316835131757
 Paper • 2403.04652 • Published • 65- Note checkout their chat space: https://huggingface.co/spaces/01-ai/Yi-34B-Chat 
- 
	
	
	A Survey on Data Selection for Language ModelsPaper • 2402.16827 • Published • 4
 - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining ResearchPaper • 2402.00159 • Published • 65- Note checkout olmo suite: https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778 
 - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data OnlyPaper • 2306.01116 • Published • 41- Note checkout datatrove: https://github.com/huggingface/datatrove (freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.) 
 - Bag of Tricks for Efficient Text ClassificationPaper • 1607.01759 • Published- Note read more: https://fasttext.cc/ 
 - Breadth-First Pipeline ParallelismPaper • 2211.05953 • Published- Note checkout: https://github.com/huggingface/nanotron (minimalistic large language model 3D-parallelism training) 
- 
	
	
	Reducing Activation Recomputation in Large Transformer ModelsPaper • 2205.05198 • Published
- 
	
	
	Sequence Parallelism: Long Sequence Training from System PerspectivePaper • 2105.13120 • Published • 6
 - Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter TransferPaper • 2203.03466 • Published • 1- Note from creators of grok: https://huggingface.co/xai-org/grok-1 
- 
	
	
	Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale ClusterPaper • 2304.03208 • Published • 1
 - Mamba: Linear-Time Sequence Modeling with Selective State SpacesPaper • 2312.00752 • Published • 146- Note checkout transformers compatible mambas: https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406 
 - Direct Preference Optimization: Your Language Model is Secretly a Reward ModelPaper • 2305.18290 • Published • 63- Note checkout https://huggingface.co/docs/trl (train transformer language models with reinforcement learning.) 
- 
	
	
	Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsPaper • 2402.14740 • Published • 15
 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained TransformersPaper • 2210.17323 • Published • 8- Note read more: https://huggingface.co/blog/gptq-integration (Making LLMs lighter with AutoGPTQ and transformers) 
 - LLM.int8(): 8-bit Matrix Multiplication for Transformers at ScalePaper • 2208.07339 • Published • 5- Note read more: https://huggingface.co/docs/bitsandbytes (accessible large language models via k-bit quantization for PyTorch) 
- 
	
	
	Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding HeadsPaper • 2401.10774 • Published • 59
 13.6k- Open LLM Leaderboard🏆- Track, rank and evaluate open LLMs and chatbots - Note checkout lighteval: https://github.com/huggingface/lighteval (lightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron) 
 
			 
					 
					 
					