[PATCH] D154858: [WIP] [AMDGPU] Add llvm.amdgcn.reduce.add Intrinsic.

Mon Jul 10 09:19:43 PDT 2023

arsenm added a comment.

I was thinking an IR expansion would be easier, but it's good to have a machine one (at least for umin)

================
Comment at: llvm/include/llvm/IR/IntrinsicsAMDGPU.td:1929
+def int_amdgcn_reduce_add :
+  Intrinsic<[llvm_i32_ty], [llvm_i32_ty],
+            [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
----------------
Should have a mangled type. I also think it should have an immarg operand for the preferred lowering strategy to use. Also, wave_reduce?

Also umin would be a better choice for a first one, given that we want it for dynamic alloca handling

================
Comment at: llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp:1
+
+#include "AMDGPU.h"
----------------
Missing header

================
Comment at: llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp:87
+  bool IsWave32 = ST->isWave32();
+  const TargetRegisterClass *RegClass =
+      IsWave32 ? &AMDGPU::SReg_32RegClass : &AMDGPU::SReg_64RegClass;
----------------
There's supposed to be a getWaveRegClass to go through

================
Comment at: llvm/lib/Target/AMDGPU/SILowerReduceAndScanPseudo.cpp:172
+
+bool SIExpandReduceAndScanPseudo::runOnMachineFunction(MachineFunction &MF) {
+
----------------
This doesn't need to be a separate pass, can be a post isel hook 

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll:2
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
+; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 < %s | FileCheck %s
+
----------------
test with  -global-isel=1/0

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll:19
+entry:
+    %result = call i32 @llvm.amdgcn.reduce.add(i32 %in)
+    store i32 %result, ptr addrspace(1) %out
----------------
Also add poison and constant tests 

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll:25
+
+define amdgpu_kernel void @divergent_value(ptr addrspace(1) %out, ptr addrspace(1) %val) #0 {
+; CHECK-LABEL: divergent_value:
----------------
Also add a test where this is under divergent control flow 

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll:52
+  %out.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %divergent_value.kernarg.segment, i64 36
+  %loaded.out.kernarg.offset = load <2 x i64>, ptr addrspace(4) %out.kernarg.offset, align 4
+  %out.load1 = extractelement <2 x i64> %loaded.out.kernarg.offset, i32 0
----------------
Should strip out most of this test

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgpu.reduce.ll:73
+
+attributes #0 = {"target-cpu"="gfx1100"}
+attributes #1 = { nocallback nofree nosync nounwind speculatable willreturn memory(none)}
----------------
Drop this, it's redundant with the run line target and breaks adding multiple run targets

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154858/new/

https://reviews.llvm.org/D154858