[Openmp-commits] [clang] [llvm] [mlir] [openmp] [OpenMP][offload] Cross-team reductions with variable number of teams (PR #195102)

Sat May 2 00:33:16 PDT 2026

ro-i wrote:

Sure, I can split the PR and provider deeper performance insights in the next days.
Every claim I made so far can be verified using the tests in https://github.com/ro-i/xteam-test.

Note that I have multiple reductions per translation unit there, which also affects performance for implementations without inlining, due to the dispatch switch for the indirect function calls (`shflFct`, `cpyFct`, etc).
Without considering any inlining, the performance benefits of this patch are especially visible for 208 teams. For 10400 teams, I even observed slight regressions*, which are mitigated by using inlining. I think that for higher team counts, the implementation in this patch suffers more from the bad generated code than the previous implementation, because the last team has a tighter loop at the end for collecting and reducing the per-team values from the other teams from global memory. 

*Note that the relevance of a regression from xx MB/s to yy MB/s in some cases is a bit relative, considering that we need to get to 1.z TB/s, for which codegen is the dominating factor, not runtime implementation details.

PS: The snippets from LIBOMPTARGET_INFO I provided in my initial post are taken from *after* rebasing my patch. Before the rebase (with the SHAs specified in the README in https://github.com/ro-i/xteam-test, commit 6854b7abc8848702b5a2d9ce2ea02849b5dc590b), the picture was exactly the same, but with a bit different register count.

https://github.com/llvm/llvm-project/pull/195102