[PATCH] D147408: [AMDGPU] Enable AMDGPU Atomic Optimizer Pass by default.

Tue Apr 4 08:25:00 PDT 2023

b-sumner added a comment.

In D147408#4243403 <https://reviews.llvm.org/D147408#4243403>, @foad wrote:

>> Scalar branches may be the most expensive aspect of this algorithm
>
> If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.
>
>     // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
>     s_mov s0, exec
>     s_mov s3, 0 // accumulator
>   // repeat this section 32 or 64 times:
>     s_ff1 s1, s0 // find lowest remaining active lane
>     s_cmp_eq s1, -1
>     s_cbranch_scc1 end
>     s_bitset0 s0, s1
>     v_readlane s2, v0, s1
>     s_add s3, s2
>     v_writelane v1, s3, s1
>   // end of repeated section
>   end:

Yes, that looks like what we want.  The challenge will be creating IR that will lower to that.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147408/new/

https://reviews.llvm.org/D147408