[llvm] [AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (PR #107108)

Tue Sep 3 06:36:53 PDT 2024

================
@@ -421,9 +421,10 @@ Value *AMDGPUAtomicOptimizerImpl::buildReduction(IRBuilder<> &B,
 
   // Reduce within each pair of rows (i.e. 32 lanes).
   assert(ST->hasPermLaneX16());
-  Value *Permlanex16Call = B.CreateIntrinsic(
-      V->getType(), Intrinsic::amdgcn_permlanex16,
-      {V, V, B.getInt32(-1), B.getInt32(-1), B.getFalse(), B.getFalse()});
+  Value *Permlanex16Call =
+      B.CreateIntrinsic(AtomicTy, Intrinsic::amdgcn_permlanex16,
+                        {PoisonValue::get(AtomicTy), V, B.getInt32(0),
+                         B.getInt32(0), B.getFalse(), B.getFalse()});
----------------
jayfoad wrote:

Using 0 instead of -1 for the lane select is just a stylistic choice, since for a reduction all lanes have the same value and it doesn't matter which one we choose. I just think 0 is the least surprising value to use.

https://github.com/llvm/llvm-project/pull/107108