arxiv:2510.00606

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

Published on Oct 1

Authors:

Abstract

ElasWave is an elastic-native training system that achieves parameter consistency, low MTTR, high post-change throughput, and computation consistency in large-scale LLM pretraining by using multi-dimensional scheduling and dynamic communication.

AI-generated summary

Large-scale LLM pretraining now runs across 10^5--10^6 accelerators, making failures routine and elasticity mandatory. We posit that an elastic-native training system must jointly deliver (i) parameter consistency, (ii) low mean time to recovery (MTTR), (iii) high post-change throughput, and (iv) computation consistency. No prior system achieves all four simultaneously. To achieve these goals, we present ElasWave, which delivers per-step fault tolerance via multi-dimensional scheduling across graph, dataflow, DVFS, and RNG. ElasWave reshapes and reshards micro-batches while preserving the global batch size and gradient scale. It performs online pipeline resharding with asynchronous parameter migration and interleaves ZeRO partitions, reducing parameter recovery processes to disjoint rank-to-rank transfers. It further leverages DVFS to absorb pipeline bubbles and reshards RNG to keep computation consistency. Together, a dynamic communicator enables in-place communication group edits, while per-step in-memory snapshots support online verification and redistribution. We evaluate ElasWave on 96 NPUs and benchmark it against state-of-the-art baselines: throughput improves by 1.35times over ReCycle and 1.60times over TorchFT; communicator recovery completes within one second (up to 82times/3.6times faster than full/partial rebuilds); migration MTTR drops by as much as 51%; and convergence deviation is reduced by approximately 78%.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.00606 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.00606 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.00606 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.