[flang-commits] [flang] [flang] Simplify hlfir.sum total reductions. (PR #119482)

Thu Dec 12 16:57:03 PST 2024

vzakhari wrote:

I tried the following test:
```
  subroutine test(x, y)
    real :: x(:,:), y(:)
    !$omp parallel workshare
    y = sum(x,dim=1)
    !$omp end parallel workshare
  end subroutine test
```

When I place the alloca inside the elemental, I get the following:
```
// -----// IR Dump After OptimizedBufferization (opt-bufferization) //----- //
  omp.parallel {
    omp.workshare {
      omp.workshare.loop_wrapper {
        omp.loop_nest (%arg2) : index = (%c1) to (%4#1) inclusive step (%c1) {
          %5 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
          fir.do_loop

// -----// IR Dump After LowerWorkshare (lower-workshare) //----- //
    omp.parallel {
      omp.wsloop nowait {
        omp.loop_nest (%arg2) : index = (%c1) to (%6#1) inclusive step (%c1) {
          %7 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
          fir.do_loop

// -----// IR Dump Before LLVMAddComdats (llvm-add-comdats) //----- //
    omp.parallel {
      %52 = llvm.alloca %51 x f32 {bindc_name = ".sum.reduction", pinned} : (i64) -> !llvm.ptr
      omp.wsloop nowait {
        omp.loop_nest (%arg2) : i64 = (%5) to (%62) inclusive step (%5) {
          llvm.store %4, %52 {tbaa = [#tbaa_tag3]} : f32, !llvm.ptr

; *** IR Dump After Annotation2MetadataPass on [module] ***
define internal void @_QFPtest..omp_par(ptr noalias %tid.addr, ptr noalias %zero.addr, ptr %0) #1 {
omp.par.region1:                                  ; preds = %omp.par.region
  %2 = alloca float, i64 1, align 4
...
define internal void @_QFPtest(ptr %0, ptr %1) #0 {
  call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr @1, i32 1, ptr @_QFPtest..omp_par, ptr %structArg)
```

So it seems that we end up allocating the temporary in each thread, and then the stack is automatically reclaimed after exiting `_QFPtest..omp_par`. It also looks fine to me at the MLIR level.

When I place the alloca outside the elemental:
```
// -----// IR Dump After OptimizedBufferization (opt-bufferization) //----- //
  omp.parallel {
    omp.workshare {
      %3 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
      omp.workshare.loop_wrapper {
        omp.loop_nest (%arg2) : index = (%c1) to (%5#1) inclusive step (%c1) {
          fir.store %cst to %3 : !fir.ref<f32>
          fir.do_loop

// -----// IR Dump After LowerWorkshare (lower-workshare) //----- //
    omp.parallel {
      %5 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
      omp.single copyprivate(%5 -> @_workshare_copy_f32 : !fir.ref<f32>) {
        omp.terminator
      }
      omp.wsloop nowait {
        omp.loop_nest (%arg2) : index = (%c1) to (%7#1) inclusive step (%c1) {
          fir.store %cst to %5 : !fir.ref<f32>
          fir.do_loop

// -----// IR Dump Before LLVMAddComdats (llvm-add-comdats) //----- //
    omp.parallel {
      %52 = llvm.alloca %51 x f32 {bindc_name = ".sum.reduction", pinned} : (i64) -> !llvm.ptr
      omp.single copyprivate(%52 -> @_workshare_copy_f32 : !llvm.ptr) {
        omp.terminator
      }
      omp.wsloop nowait {
        omp.loop_nest (%arg2) : i64 = (%5) to (%63) inclusive step (%5) {
          llvm.store %4, %52 {tbaa = [#tbaa_tag4]} : f32, !llvm.ptr

; *** IR Dump After Annotation2MetadataPass on [module] ***
define internal void @_QFPtest..omp_par(ptr noalias %tid.addr, ptr noalias %zero.addr, ptr %0) #1 {
omp.par.region1:                                  ; preds = %omp.par.region
  %2 = alloca float, i64 1, align 4
  %3 = alloca i32, align 4
  store i32 0, ptr %3, align 4
  %omp_global_thread_num2 = call i32 @__kmpc_global_thread_num(ptr @1)
  %4 = call i32 @__kmpc_single(ptr @1, i32 %omp_global_thread_num2)
  %5 = icmp ne i32 %4, 0
  br i1 %5, label %omp_region.body, label %omp_region.end

omp_region.end:                                   ; preds = %omp.par.region1, %omp.region.cont3
  %omp_global_thread_num4 = call i32 @__kmpc_global_thread_num(ptr @1)
  %6 = load i32, ptr %3, align 4
  call void @__kmpc_copyprivate(ptr @1, i32 %omp_global_thread_num4, i64 0, ptr %2, ptr @_workshare_copy_f32, i32 %6)
...
define internal void @_QFPtest(ptr %0, ptr %1) #0 {
  call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr @1, i32 1, ptr @_QFPtest..omp_par, ptr %structArg)
```

So besides the `single copyprivate` we end up with the same behavior, I think.

It seems to me hoisting the alloca should be okay, but might be not enough to get rid of stacksave/stackrestore always (e.g. if we generate the elemental inside another loop, we will end up with the stack bookkeeping anyway).

I am now inclining to get back to the SSA reduction, though modifying `hlfir::genLoopNest` for this will look awkward (due to no support of SSA reduction by the related OpenMP constructs). So I will probably add something like `hlfir::genLoopNestWithSSAReduction` that will only generate `fir.do_loop`s, and then we can try to merge the two while also resolving it for OpenMP.

What do you think?

https://github.com/llvm/llvm-project/pull/119482