[PATCH] D147408: [AMDGPU] Enable AMDGPU Atomic Optimizer Pass by default.

Wed Apr 5 23:52:51 PDT 2023

ruiling added a comment.

> If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.
>
>     // Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3.
>     s_mov s0, exec
>     s_mov s3, 0 // accumulator
>   // repeat this section 32 or 64 times:
>     s_ff1 s1, s0 // find lowest remaining active lane
>     s_cmp_eq s1, -1
>     s_cbranch_scc1 end
>     s_bitset0 s0, s1
>     v_readlane s2, v0, s1
>     s_add s3, s2
>     v_writelane v1, s3, s1
>   // end of repeated section
>   end:

The LLVM IR that can do this:

  bb0:
    %value = ... 
    %ballot = call i32 @llvm.amdgcn.ballot.i32(i1 1)
    br label %bb1

  bb1:
    %accum = phi i32 [ 0, %entry ], [ %new_accum, %bb1 ]
    %old_value_phi = phi i32 [ poison, %entry ], [ %old_value, %bb1 ]
    %active_bits = phi i32 [ %ballot, %entry ], [ %new_active_bits, %bb1 ]
    %ff1 = call i32 @llvm.cttz.i32(i32 %active_bits, i1 true)

    %lane_value = call i32 @llvm.amdgcn.readlane(i32 %value, i32 %ff1)
    %old_value = call i32 @llvm.amdgcn.writelane(i32 %accum, i32 %ff1, i32 %old_value_phi)
    %new_accum = add i32 %accum, %lane_value

    %mask = shl i32 1, %ff1
    %inverse_mask = xor i32 %mask, -1
    %new_active_bits = and i32 %active_bits, %inverse_mask
    %is_end = icmp eq i32 %new_active_bits, 0
    br i1 %is_end, label %bb2, label %bb1

  bb2:

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147408/new/

https://reviews.llvm.org/D147408