Update README.md
Browse files
README.md
CHANGED
|
@@ -27,7 +27,7 @@ model-index:
|
|
| 27 |
---
|
| 28 |
|
| 29 |
## **Abstract**
|
| 30 |
-
This repository provides a domain-adapted Turkish legal instruction-tuned model derived from Meta’s Llama-3.1-8B-Instruct. As part of the “Harnessing Fully Sharded Data Parallelism v2 with Float8 Precision for Faster Training” study, this configuration represents the BF16 variant trained on 4 nodes with a 32 global batch size.
|
| 31 |
In this scaling regime, FP8 mixed-precision did not yield a runtime improvement over BF16, highlighting how FP8 efficiency varies with batch size, sequence parallelism, and multi-node communication overhead. This model provides a strong BF16 baseline for comparison across all batch-size and node-scaling experiments in the study.
|
| 32 |
## **Experiment Context**
|
| 33 |
This model was trained as part of our study for comparing **FSDP2 with bfloat16 precision** against **FSDP2 with FP8 mixed precision bfp16-fp8**.
|
|
|
|
| 27 |
---
|
| 28 |
|
| 29 |
## **Abstract**
|
| 30 |
+
This repository provides a domain-adapted Turkish legal instruction-tuned model derived from Meta’s Llama-3.1-8B-Instruct. As part of the “Harnessing Fully Sharded Data Parallelism v2 with Float8 Precision for Faster Training” study, this configuration represents the BF16 variant with using the default **Tensorwise** quantization scaling recipe trained on 4 nodes with a 32 global batch size.
|
| 31 |
In this scaling regime, FP8 mixed-precision did not yield a runtime improvement over BF16, highlighting how FP8 efficiency varies with batch size, sequence parallelism, and multi-node communication overhead. This model provides a strong BF16 baseline for comparison across all batch-size and node-scaling experiments in the study.
|
| 32 |
## **Experiment Context**
|
| 33 |
This model was trained as part of our study for comparing **FSDP2 with bfloat16 precision** against **FSDP2 with FP8 mixed precision bfp16-fp8**.
|