Troubleshooting Interconnect: Share Your Experience

#1
by nouamanetazi - opened
Hugging Face Smol Models Research org

Hi everyone! πŸ‘‹

The Troubleshooting Interconnect section has some initial findings on common NCCL performance issues (CPU affinity, network topology, environment variables, container configs).

[Read the full section here β†’] https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#troubleshooting-interconnect

We'd love to make this a living community resource! Have you run into:

  • NCCL performance bottlenecks that were tricky to debug?
  • Cloud-specific networking issues (AWS EFA, GCP, Azure)?
  • Container configuration gotchas?
  • Effective debugging workflows or tools?

Share your troubleshooting stories, solutions, or questions below.
Your experience could help others avoid hours of debugging! πŸ€—

Sign up or log in to comment