[PATCH] D147408: [AMDGPU] Enable AMDGPU Atomic Optimizer Pass by default.

Tue Apr 4 07:22:46 PDT 2023

b-sumner added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:463
+
+  for (unsigned LaneIdx = 0; LaneIdx < WaveFrontSize; LaneIdx++) {
+    // Iterate over all the lanes of a wavefront to compute the partial sum. If
----------------
pravinjagtap wrote:
> ruiling wrote:
> > Why do we choose to unroll the loop over wave-front-size? I think this makes the sp3 assembly hard to read. Shouldn't a loop over active lanes just work?
> Hello @ruiling,
> 
> One of the considerations for selecting this approach is its simplicity and efforts required for implementation. We know that the most optimized implementation for the scan is DPP with WWM. In the future, this iterative approach will become redundant when concerns related to WWM robustness are addressed. If you and everyone else think that loop over active lanes is the right thing to do, I will start implementing it.
Scalar branches may be the most expensive aspect of this algorithm .  If the loop is fully unrolled, we still end up with 64 in wave64.  If we didn't unroll the active-lane approach loop, then if the wave is full, then couldn't we end up with 126 scalar branches?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147408/new/

https://reviews.llvm.org/D147408