DDP error when multi-gpu finetuning with speech encoder parameters unfrozen

#63

by rumourscape - opened Apr 8

Apr 8

I receive the following error when finetuning with multiple GPUs with DDP. I have also set ddp_find_unused_parameters=True

[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank0]: Parameter at index 875 with name model.embed_tokens_extend.audio_embed.encoder.encoders.23._checkpoint_wrapped_module.layer_norm.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

KevinKibe

Apr 15

•

edited Apr 15

Experienced similar error, this is my setup

def main():
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    args = parse_args()
    accelerator = Accelerator()
    logger.info("Loading datasets")
    dataset = load_data(datasets_paths=args.datasets_paths, sample_count=None)
    with accelerator.local_main_process_first():
        logger.info("Loading model and processor")
        model, processor = load_model_processor(args.model)
        model = unfreeze_speech_components(model)
        # Verify unfrozen parameters
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        logger.info(f"Trainable parameters: {trainable_params:,}")
        logger.info("Unfrozen components:")


        # After unfreezing
        encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
        proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())

        assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
        assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
        logger.info("Components properly unfrozen")

    logger.info("Processing dataset")
    train_processed_dataset = DatasetProcessor(
        split="train",
        dataset=dataset["train"],
        processor=processor,
    )

    validation_processed_dataset = DatasetProcessor(
        split="validation",
        dataset=dataset["validation"],
        processor=processor,
    )
    
    num_cpus = mp.cpu_count()

    try:
        training_args = TrainingArguments(
            ddp_find_unused_parameters=True,
            num_train_epochs=args.epochs,
            per_device_train_batch_size=args.train_batch_size,
            per_device_eval_batch_size=args.eval_batch_size,
            gradient_checkpointing=True,
            gradient_checkpointing_kwargs={'use_reentrant': False},
            gradient_accumulation_steps=args.gradient_accumulation_steps,
            optim='adamw_torch',
            adam_beta1=0.9,
            adam_beta2=0.95,
            adam_epsilon=1e-7,
            learning_rate=4.0e-5,
            weight_decay=args.weight_decay,
            max_grad_norm=1.0,
            lr_scheduler_type='linear',
            warmup_steps=args.num_warmup_steps,
            logging_steps=50,
            output_dir=os.path.join(args.output_dir, 'checkpoints'),
            save_total_limit=10,
            save_only_model=True,
            remove_unused_columns=False,
            report_to='none',
            deepspeed=None,
            dataloader_num_workers=num_cpus-4,
            save_strategy="epoch",
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            data_collator=collate_fn,
            train_dataset=train_processed_dataset
        )

        save_processor_callback = SaveProcessorCallback(processor, accelerator, trainer)
        logging_callback = LoggingCallback()
        evaluation_callback = EvaluationCallback(model, processor, validation_processed_dataset, training_args)

        trainer.add_callback(save_processor_callback)
        trainer.add_callback(logging_callback)
        trainer.add_callback(evaluation_callback)
        logger.info("Starting training...")
        trainer.train()
        logger.info("Training completed successfully")

    except Exception as e:
        logger.error(f"Training failed: {e}")
        raise e

Andrey

Apr 18

I had the same problem. Replacing docker training at https://github.com/anastasiosyal/phi4-multimodal-instruct-server/blob/main/dockerfile helped

ukemamaster

Jul 23

•

edited Jul 23

@Andrey What do you mean by "Replacing docker training" ? Could you please explain how did you train with multi-GPUs?

ukemamaster

Jul 24

@rumourscape @KevinKibe any updates on this?

KevChege

Microsoft org 2 days ago

This comment has been hidden (marked as Off-Topic)

KevChege

Microsoft org 2 days ago

The error occurs because Hugging Face internally wraps encoder layers with checkpointing, which uses reentrant backward passes. DDP expects each parameter to be marked ready only once per iteration, but checkpointing marks them multiple times, causing the conflict.

Fix:
Disable internal checkpointing by replacing the wrapper with the raw module:

def disable_internal_checkpointing(model):
    for i, layer in enumerate(model.model.embed_tokens_extend.audio_embed.encoder.encoders):
        if hasattr(layer, "_checkpoint_wrapped_module"):
            model.model.embed_tokens_extend.audio_embed.encoder.encoders[i] = layer._checkpoint_wrapped_module
    return model

# Apply after loading the model
model = disable_internal_checkpointing(model)

PS: make sure ddp_find_unused_parameters is set to True

@ukemamaster , @rumourscape , let me know if this works for you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment