tiiuae
/

falcon-rw-1b

Text Generation

text-generation-inference

Model card Files Files and versions

slippylolo commited on May 26, 2023

Commit

cca0510

·

1 Parent(s): 831952d

Update model architecture

Files changed (1) hide show

README.md +20 -7

README.md CHANGED Viewed

@@ -65,7 +65,7 @@ for seq in sequences:
 ### Direct Use
-Research on large language models, and the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).
 ### Out-of-Scope Use
@@ -127,13 +127,16 @@ Falcon-RW-1B was trained on 32 A100 40GB GPUs, using only data parallelism with
 Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).
-- **Precision:** bf16;
-- **Optimizer:** Adam;
-- **Learning rate:** 2e-4 (500M tokens warm-up, followed by cosine decay to 2e-5);
-- **Weight decay:** 0.1;
-- **Batch size:** 512 (with a 4B tokens ramp-up).
-#### Speeds, Sizes, Times [optional]
 Training happened in early December 2022 and took about six days.
@@ -149,6 +152,16 @@ Training happened in early December 2022 and took about six days.
 Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
 ### Compute Infrastructure
 #### Hardware

 ### Direct Use
+Research on large language models, specifically the influence of adequately filtered and deduplicated web data on the properties of large language models (fairness, safety, limitations, capabilities, etc.).
 ### Out-of-Scope Use
 Hyperparameters were adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)).
+| **Hyperparameter** | **Value**  | **Comment**                               |
+|--------------------|------------|-------------------------------------------|
+| Precision          | `bfloat16` |                                           |
+| Optimizer          | AdamW      |                                           |
+| Learning rate      | 2e-4       | 500M tokens warm-up, cosine decay to 2e-5 |
+| Weight decay       | 1e-1       |                                           |
+| Batch size         | 512        | 4B tokens ramp-up                         |
+#### Speeds, Sizes, Times
 Training happened in early December 2022 and took about six days.
 Falcon-RW-1B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
+The architecture is adapted from the GPT-3 paper ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)), but uses ALiBi ([Ofir et al., 2021](https://arxiv.org/abs/2108.12409)) and FlashAttention ([Dao et al., 2022](https://arxiv.org/abs/2205.14135)).
+| **Hyperparameter** | **Value** | **Comment**                            |
+|--------------------|-----------|----------------------------------------|
+| Layers             | 24        |                                        |
+| `d_model`          | 2048      |                                        |
+| `head_dim`         | 64        | Reduced to optimise for FlashAttention |
+| Vocabulary         | 50304     |                                        |
+| Sequence length    | 2048      |                                        |
 ### Compute Infrastructure
 #### Hardware