[Mlir-commits] [llvm] [mlir] [OMPIRBuilder] Hoist alloca's to entry blocks of compiler-emitted GPU reduction functions (PR #181359)

Mon Feb 16 04:02:49 PST 2026

ergawy wrote:

> @ergawy, can you elaborate on what exactly causes the runtime memory fault? I see a bit more context in [ccb47d0](https://github.com/llvm/llvm-project/commit/ccb47d0fb9d01d44764fa4ca5c6dcf239ab76ed2), but it's still not clear to me. Is there something mal-formed about the alloca's appearing outside of the entry block, or is it a limitation of the GPU backend that can't handle it? Also, is the issue limited to `-O0` because the alloca's are hoisted by other passes at higher optimization levels?

In the following, I use 4 different compilations of the reproducer in the PR description:
1. `-O0` **without** the fixes in this PR.
2. `-O0` **with** the fixes in this PR.
3. `-O3` **without** the fixes in this PR.
4. `-O3` **with** the fixes in this PR.

In each of the above scenarios, I compiled using `--save-temps`.

A few observations that might help clarify the issue:

1. Inspecting the GPU assembly file (`test.amdgcn.gfx90a.img.lto.s`), one interestng thing to note is that only case 1 above (`-O0` without the fix) uses dyanmic stack allocations. You can verify that by searching for `has_dyn_sized_stack`. You will find that `has_dyn_sized_stack` is set to `1` for `__omp_offloading_fc00_6b22f64__QQmain_l7_omp$reduction$reduction_func` only for case 1 above. In this case, the dynamic stack adjustmend, I think, happens in an instruction you will find the function's prologue: `s_add_i32 s32, s32, 0x800`. In all other cases (2, 3, and 4 above) neither `has_dyn_size_stack` is set to true nor there is a similar stack adjustment. This dynamic stack allocation happens only because without the fix, some `alloca` instructions are emitted in non-entry blocks. It is only visible in `-O0` since apparently in higher opt levels such allocations are hoisted by later passes (I did not narrow down which pass(es) help with that).

2. In the Fortran reproducer above, **even without the fixes in the PR**, if you change the number of iterations to be <= 64, then the code runs successfully and produces the correct result. If we change the iterations to anything > 64, then we start to observe the runtime crash. 64 is the number of threads in a single wave/warp. This supports the previous obsevation that dynamic stack allocation due to non-entry block `alloca`s is the reason of the crash. If you need more than one wave (i.e. num of iterations > 64), then cross-wave/warp reduction functions has to do work that was needed for <= 64 iterations. It is in these compiler-emitted functions where we have non-entry allocations.

I am not a backend expert so I might have missed some part of the picture. But I hope this at least clarifies the issue and justifies the fix.

https://github.com/llvm/llvm-project/pull/181359