HIP Grouped GEMM Status (2025-09-18)

Current toggle

Set MEGABLOCKS_GG_USE_HIPBLASLT=1 to force the ROCm build to run the hipBLASLt backend instead of the FP32 fallback in hipblaslt_gmm_internal.
Without the flag the code uses the stable FP32 torch::matmul path that overwrites the destination buffer.

_dev/debug-gg-small.py, _dev/debug-tensor-copy.py, and _dev/debug-gg-detailed.py finish with finite outputs (differences are within ~1e-3..1e-2 due to BF16).
python -m pytest tests/test_gg.py -q passes with the flag set.

PYTHONPATH=build/... MEGABLOCKS_GG_USE_HIPBLASLT=1 python -m pytest tests/ops_test.py -q aborts with a HIP memory access fault (Memory access fault by GPU node-2 during OpsTest.testGroupedGemm_FixedSizes).
The same failure occurs early when the test suite is run via run-tests.sh, so hipBLASLt is not yet production-ready.

Reproduce the fault in isolation (likely the large (z=16, m=128, k=128, n=128) cases) and inspect the arguments passed into hipblaslt_run_matmul (leading dimensions/layout).
Investigate whether hipBLASLt requires column-major layouts or non-zero workspace to handle the grouped GEMM shapes.
Consider hybrid strategy: attempt hipBLASLt per expert and fall back to FP32 for shapes that exceed stability thresholds (e.g., by catching hipblaslt_run_matmul errors once we can reliably detect them).
Once hipBLASLt is stable, tighten tolerances/grad checks in tests/test_gg.py and re-enable the high-performance path by default.