[PATCH] D146523: [AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy

Tue Mar 21 12:49:32 PDT 2023

foad requested changes to this revision.
foad added a comment.
This revision now requires changes to proceed.
Herald added a subscriber: StephenFan.

I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.

If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.

However... why are you doing readlane //inside// the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)

  s_mov s0, 0 ; initialize accumulator
  ; conditionally add in lane 0
  v_readlane s1, v0, 0
  s_bitcmp1 exec, 0
  s_cselect s1, s1, 0
  s_add s0, s0, s1
  ; conditionally add in lane 1
  v_readlane s1, v0, 1
  s_bitcmp1 exec, 1
  s_cselect s1, s1, 0
  s_add s0, s0, s1
  ...
  ; conditionally add in lane 31
  v_readlane s1, v0, 31
  s_bitcmp1 exec, 31
  s_cselect s1, s1, 0
  s_add s0, s0, s1
  ; result is in s0

You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146523/new/

https://reviews.llvm.org/D146523