[flang-commits] [flang] [flang] Simplify hlfir.sum total reductions. (PR #119482)
Slava Zakharin via flang-commits
flang-commits at lists.llvm.org
Thu Dec 12 16:57:03 PST 2024
vzakhari wrote:
I tried the following test:
```
subroutine test(x, y)
real :: x(:,:), y(:)
!$omp parallel workshare
y = sum(x,dim=1)
!$omp end parallel workshare
end subroutine test
```
When I place the alloca inside the elemental, I get the following:
```
// -----// IR Dump After OptimizedBufferization (opt-bufferization) //----- //
omp.parallel {
omp.workshare {
omp.workshare.loop_wrapper {
omp.loop_nest (%arg2) : index = (%c1) to (%4#1) inclusive step (%c1) {
%5 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
fir.do_loop
// -----// IR Dump After LowerWorkshare (lower-workshare) //----- //
omp.parallel {
omp.wsloop nowait {
omp.loop_nest (%arg2) : index = (%c1) to (%6#1) inclusive step (%c1) {
%7 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
fir.do_loop
// -----// IR Dump Before LLVMAddComdats (llvm-add-comdats) //----- //
omp.parallel {
%52 = llvm.alloca %51 x f32 {bindc_name = ".sum.reduction", pinned} : (i64) -> !llvm.ptr
omp.wsloop nowait {
omp.loop_nest (%arg2) : i64 = (%5) to (%62) inclusive step (%5) {
llvm.store %4, %52 {tbaa = [#tbaa_tag3]} : f32, !llvm.ptr
; *** IR Dump After Annotation2MetadataPass on [module] ***
define internal void @_QFPtest..omp_par(ptr noalias %tid.addr, ptr noalias %zero.addr, ptr %0) #1 {
omp.par.region1: ; preds = %omp.par.region
%2 = alloca float, i64 1, align 4
...
define internal void @_QFPtest(ptr %0, ptr %1) #0 {
call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr @1, i32 1, ptr @_QFPtest..omp_par, ptr %structArg)
```
So it seems that we end up allocating the temporary in each thread, and then the stack is automatically reclaimed after exiting `_QFPtest..omp_par`. It also looks fine to me at the MLIR level.
When I place the alloca outside the elemental:
```
// -----// IR Dump After OptimizedBufferization (opt-bufferization) //----- //
omp.parallel {
omp.workshare {
%3 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
omp.workshare.loop_wrapper {
omp.loop_nest (%arg2) : index = (%c1) to (%5#1) inclusive step (%c1) {
fir.store %cst to %3 : !fir.ref<f32>
fir.do_loop
// -----// IR Dump After LowerWorkshare (lower-workshare) //----- //
omp.parallel {
%5 = fir.alloca f32 {bindc_name = ".sum.reduction", pinned}
omp.single copyprivate(%5 -> @_workshare_copy_f32 : !fir.ref<f32>) {
omp.terminator
}
omp.wsloop nowait {
omp.loop_nest (%arg2) : index = (%c1) to (%7#1) inclusive step (%c1) {
fir.store %cst to %5 : !fir.ref<f32>
fir.do_loop
// -----// IR Dump Before LLVMAddComdats (llvm-add-comdats) //----- //
omp.parallel {
%52 = llvm.alloca %51 x f32 {bindc_name = ".sum.reduction", pinned} : (i64) -> !llvm.ptr
omp.single copyprivate(%52 -> @_workshare_copy_f32 : !llvm.ptr) {
omp.terminator
}
omp.wsloop nowait {
omp.loop_nest (%arg2) : i64 = (%5) to (%63) inclusive step (%5) {
llvm.store %4, %52 {tbaa = [#tbaa_tag4]} : f32, !llvm.ptr
; *** IR Dump After Annotation2MetadataPass on [module] ***
define internal void @_QFPtest..omp_par(ptr noalias %tid.addr, ptr noalias %zero.addr, ptr %0) #1 {
omp.par.region1: ; preds = %omp.par.region
%2 = alloca float, i64 1, align 4
%3 = alloca i32, align 4
store i32 0, ptr %3, align 4
%omp_global_thread_num2 = call i32 @__kmpc_global_thread_num(ptr @1)
%4 = call i32 @__kmpc_single(ptr @1, i32 %omp_global_thread_num2)
%5 = icmp ne i32 %4, 0
br i1 %5, label %omp_region.body, label %omp_region.end
omp_region.end: ; preds = %omp.par.region1, %omp.region.cont3
%omp_global_thread_num4 = call i32 @__kmpc_global_thread_num(ptr @1)
%6 = load i32, ptr %3, align 4
call void @__kmpc_copyprivate(ptr @1, i32 %omp_global_thread_num4, i64 0, ptr %2, ptr @_workshare_copy_f32, i32 %6)
...
define internal void @_QFPtest(ptr %0, ptr %1) #0 {
call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr @1, i32 1, ptr @_QFPtest..omp_par, ptr %structArg)
```
So besides the `single copyprivate` we end up with the same behavior, I think.
It seems to me hoisting the alloca should be okay, but might be not enough to get rid of stacksave/stackrestore always (e.g. if we generate the elemental inside another loop, we will end up with the stack bookkeeping anyway).
I am now inclining to get back to the SSA reduction, though modifying `hlfir::genLoopNest` for this will look awkward (due to no support of SSA reduction by the related OpenMP constructs). So I will probably add something like `hlfir::genLoopNestWithSSAReduction` that will only generate `fir.do_loop`s, and then we can try to merge the two while also resolving it for OpenMP.
What do you think?
https://github.com/llvm/llvm-project/pull/119482
More information about the flang-commits
mailing list