[PATCH] D146523: [AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy

Wed Mar 22 17:37:37 PDT 2023

ruiling added a comment.

>   VAL = ... // VGPR
>   RES = ... // FINAL result of scan, active lanes will write to this VGPR
>   sum = 0;                               // SGPR, holds the partial sum
>   for (int lane = 0; lane < 64; lane++) {
>       if(IsActive(lane)) {                      // check to see whether lane is active or not 
>           elementToadd = readlane(VAL, lane );  // SGPR, read value which we want to add from VAL at lane id
>           sum = sum + elementToadd;            // SGPR, update the value
>           writelane(RES, sum, lane )          // write value sum(SGPR) to VGPR RES at lane
>       } 
>   }

The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:

  // all the active threads should enter the loop.
  do {
   get_first_active_lane()
   bool hasUnprocessedlane = true;
   if (is_first_active_lane) {
     // only the first active lane will go here, other threads will skip to the loop end.
     do the work for this active lane
     hasUnprocessedLane = false;
   }
  }  while (hasUnprocessedLane);

The `hasUnprocessedLane` was used to say that the first active lane being processed in this iteration should exit the loop.

> define protected amdgpu_kernel void @_Z3SumPiS_S_(...) local_unnamed_addr #0 {
> entry:
>
>   %Alloca = alloca i32, align 4, addrspace(5)
>   %ExclScan = load i32, ptr addrspace(5) %Alloca, align 4
>   ....
>   ....
>   %sub.i = sub nsw i32 0, %11
>   %12 = call i64 @llvm.amdgcn.ballot.i64(i1 true)
>   %13 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
>   %14 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %13)
>   %15 = call i32 @llvm.amdgcn.readfirstlane(i32 %14)
>   %16 = icmp eq i32 %14, %15
>   br i1 %16, label %LoopHeader, label %19

The branching condition here is broken, it is the same as to say only the first active lane will branch to LoopHeader. That's why the sunk instruction got executed only once. But in fact, all the active threads should branch to LoopHeader. 
The rule is that the branch condition should correctly reflect which threads will take the true successor, we should not break this rule.
[...]

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146523/new/

https://reviews.llvm.org/D146523