Papers
arxiv:2510.00606

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Published on Oct 1
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

ElasWave is an elastic-native training system that achieves parameter consistency, low MTTR, high post-change throughput, and computation consistency in large-scale LLM pretraining by using multi-dimensional scheduling and dynamic communication.

AI-generated summary

Large-scale LLM pretraining now runs across 10^5--10^6 accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by 1.35times over ReCycle and 1.60times over TorchFT; communicator recovery completes within one second (up to 82times/3.6times faster than full/partial rebuilds); migration MTTR drops by as much as 51%; and convergence deviation is reduced by approximately 78%.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.00606 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.00606 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.00606 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.