Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Troubleshooting Interconnect: Share Your Experience
#1
by
nouamanetazi
- opened
Hi everyone! π
The Troubleshooting Interconnect section has some initial findings on common NCCL performance issues (CPU affinity, network topology, environment variables, container configs).
[Read the full section here β] https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#troubleshooting-interconnect
We'd love to make this a living community resource! Have you run into:
- NCCL performance bottlenecks that were tricky to debug?
- Cloud-specific networking issues (AWS EFA, GCP, Azure)?
- Container configuration gotchas?
- Effective debugging workflows or tools?
Share your troubleshooting stories, solutions, or questions below.
Your experience could help others avoid hours of debugging! π€