DDP error when multi-gpu finetuning with speech encoder parameters unfrozen
I receive the following error when finetuning with multiple GPUs with DDP. I have also set ddp_find_unused_parameters=True
[rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
[rank0]: Parameter at index 875 with name model.embed_tokens_extend.audio_embed.encoder.encoders.23._checkpoint_wrapped_module.layer_norm.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
Experienced similar error, this is my setup
def main():
os.environ["TOKENIZERS_PARALLELISM"] = "false"
args = parse_args()
accelerator = Accelerator()
logger.info("Loading datasets")
dataset = load_data(datasets_paths=args.datasets_paths, sample_count=None)
with accelerator.local_main_process_first():
logger.info("Loading model and processor")
model, processor = load_model_processor(args.model)
model = unfreeze_speech_components(model)
# Verify unfrozen parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
logger.info(f"Trainable parameters: {trainable_params:,}")
logger.info("Unfrozen components:")
# After unfreezing
encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
logger.info("Components properly unfrozen")
logger.info("Processing dataset")
train_processed_dataset = DatasetProcessor(
split="train",
dataset=dataset["train"],
processor=processor,
)
validation_processed_dataset = DatasetProcessor(
split="validation",
dataset=dataset["validation"],
processor=processor,
)
num_cpus = mp.cpu_count()
try:
training_args = TrainingArguments(
ddp_find_unused_parameters=True,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.train_batch_size,
per_device_eval_batch_size=args.eval_batch_size,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant': False},
gradient_accumulation_steps=args.gradient_accumulation_steps,
optim='adamw_torch',
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-7,
learning_rate=4.0e-5,
weight_decay=args.weight_decay,
max_grad_norm=1.0,
lr_scheduler_type='linear',
warmup_steps=args.num_warmup_steps,
logging_steps=50,
output_dir=os.path.join(args.output_dir, 'checkpoints'),
save_total_limit=10,
save_only_model=True,
remove_unused_columns=False,
report_to='none',
deepspeed=None,
dataloader_num_workers=num_cpus-4,
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=collate_fn,
train_dataset=train_processed_dataset
)
save_processor_callback = SaveProcessorCallback(processor, accelerator, trainer)
logging_callback = LoggingCallback()
evaluation_callback = EvaluationCallback(model, processor, validation_processed_dataset, training_args)
trainer.add_callback(save_processor_callback)
trainer.add_callback(logging_callback)
trainer.add_callback(evaluation_callback)
logger.info("Starting training...")
trainer.train()
logger.info("Training completed successfully")
except Exception as e:
logger.error(f"Training failed: {e}")
raise e
I had the same problem. Replacing docker training at https://github.com/anastasiosyal/phi4-multimodal-instruct-server/blob/main/dockerfile helped
@Andrey What do you mean by "Replacing docker training" ? Could you please explain how did you train with multi-GPUs?
The error occurs because Hugging Face internally wraps encoder layers with checkpointing, which uses reentrant backward passes. DDP expects each parameter to be marked ready only once per iteration, but checkpointing marks them multiple times, causing the conflict.
Fix:
Disable internal checkpointing by replacing the wrapper with the raw module:
def disable_internal_checkpointing(model):
for i, layer in enumerate(model.model.embed_tokens_extend.audio_embed.encoder.encoders):
if hasattr(layer, "_checkpoint_wrapped_module"):
model.model.embed_tokens_extend.audio_embed.encoder.encoders[i] = layer._checkpoint_wrapped_module
return model
# Apply after loading the model
model = disable_internal_checkpointing(model)
PS: make sure ddp_find_unused_parameters is set to True
@ukemamaster , @rumourscape , let me know if this works for you