efederici
/

ipt-350m

+---
+license: apache-2.0
+tags:
+- ipt
+- alibi
+inference: false
+datasets:
+- oscar-corpus/OSCAR-2301
+language:
+- it
+---
+# ipt-350m
+ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text.
+It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)).
+ipt-350m is:
+- **Licensed for the possibility of commercial use**
+- **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409).
+- **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
+- **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
+## How to Use
+```python
+import transformers
+model = transformers.AutoModelForCausalLM.from_pretrained(
+  'efederici/ipt-350m-alibi',
+  trust_remote_code=True
+)
+```
+Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
+To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
+```python
+import torch
+import transformers
+name = 'efederici/ipt-350m-alibi'
+config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
+config.attn_config['attn_impl'] = 'triton'
+config.init_device = 'cuda:0'
+model = transformers.AutoModelForCausalLM.from_pretrained(
+  name,
+  config=config,
+  torch_dtype=torch.bfloat16,
+  trust_remote_code=True
+)
+```
+Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.
+```python
+import transformers
+name = 'efederici/ipt-350m-alibi'
+config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
+config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
+model = transformers.AutoModelForCausalLM.from_pretrained(
+  name,
+  config=config,
+  trust_remote_code=True
+)
+```
+## Model Description
+The architecture is a modification of a standard decoder-only transformer.
+The model has been modified from a standard transformer in the following ways:
+- It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
+- It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
+- It does not use biases
+| Hyperparameter | Value |
+|----------------|-------|
+|n_parameters | 350M |
+|n_layers | 24 |
+| n_heads | 16 |
+| d_model | 1024 |
+| vocab size | 50432 |
+| sequence length | 2048 |
+### Dataset
+The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301).
+Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
+Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points.
+If you like this project, consider supporting me with a cup of coffee! 🤖✨🌞
+[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici)