# HIP Grouped GEMM Status (2025-09-18) ## Current toggle - Set `MEGABLOCKS_GG_USE_HIPBLASLT=1` to force the ROCm build to run the hipBLASLt backend instead of the FP32 fallback in `hipblaslt_gmm_internal`. - Without the flag the code uses the stable FP32 `torch::matmul` path that overwrites the destination buffer. ## What works with hipBLASLt enabled - `_dev/debug-gg-small.py`, `_dev/debug-tensor-copy.py`, and `_dev/debug-gg-detailed.py` finish with finite outputs (differences are within ~1e-3..1e-2 due to BF16). - `python -m pytest tests/test_gg.py -q` passes with the flag set. ## Known failures - `PYTHONPATH=build/... MEGABLOCKS_GG_USE_HIPBLASLT=1 python -m pytest tests/ops_test.py -q` aborts with a HIP memory access fault (`Memory access fault by GPU node-2` during `OpsTest.testGroupedGemm_FixedSizes`). - The same failure occurs early when the test suite is run via `run-tests.sh`, so hipBLASLt is not yet production-ready. ## Next steps - Reproduce the fault in isolation (likely the large `(z=16, m=128, k=128, n=128)` cases) and inspect the arguments passed into `hipblaslt_run_matmul` (leading dimensions/layout). - Investigate whether hipBLASLt requires column-major layouts or non-zero workspace to handle the grouped GEMM shapes. - Consider hybrid strategy: attempt hipBLASLt per expert and fall back to FP32 for shapes that exceed stability thresholds (e.g., by catching `hipblaslt_run_matmul` errors once we can reliably detect them). - Once hipBLASLt is stable, tighten tolerances/grad checks in `tests/test_gg.py` and re-enable the high-performance path by default.