[PATCH] D158468: [AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11

Wed Aug 23 14:43:27 PDT 2023

arsenm accepted this revision.
arsenm added inline comments.
This revision is now accepted and ready to land.

================
Comment at: llvm/lib/Target/AMDGPU/VOP3PInstructions.td:437-452
 defm V_DOT4_I32_IU8 : VOP3PDOTIUInst<"v_dot4_i32_iu8", int_amdgcn_sudot4>;
 defm V_DOT8_I32_IU4 : VOP3PDOTIUInst<"v_dot8_i32_iu4", int_amdgcn_sudot8>;
+
+def : GCNPat < (int_amdgcn_sdot8 i32:$src0,
+                                 i32:$src1,
+                                 i32:$src2, (i1 timm:$clamp)),
+               (V_DOT8_I32_IU4  (i32 8), i32:$src0,
----------------
jrbyrnes wrote:
> jrbyrnes wrote:
> > arsenm wrote:
> > > jrbyrnes wrote:
> > > > arsenm wrote:
> > > > > jrbyrnes wrote:
> > > > > > arsenm wrote:
> > > > > > > I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name?
> > > > > > On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). 
> > > > > > 
> > > > > > This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions.
> > > > > > 
> > > > > Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change
> > > > Apologies, It is my mistake potentially causing confusion.
> > > > 
> > > > The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. 
> > > > 
> > > > 
> > > > 
> > > > See llvm.amdgcn.sudot4 for unit tests.
> > > I mean tests that actually execute, not lit tests
> > So I've tracked down some unit tests.
> > 
> > https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34
> > 
> > These calls are implemented as calls to __ockl_sdot4:
> > 
> > https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60
> > 
> > Which is, in turn, implemented as calls to target specific builtins:
> > 
> > https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124
> > 
> > For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. 
> > 
> > 
> > If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests.
> > 
> Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100.
Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases

================
Comment at: llvm/lib/Target/AMDGPU/VOP3PInstructions.td:437-452
 defm V_DOT4_I32_IU8 : VOP3PDOTIUInst<"v_dot4_i32_iu8", int_amdgcn_sudot4>;
 defm V_DOT8_I32_IU4 : VOP3PDOTIUInst<"v_dot8_i32_iu4", int_amdgcn_sudot8>;
+
+def : GCNPat < (int_amdgcn_sdot8 i32:$src0,
+                                 i32:$src1,
+                                 i32:$src2, (i1 timm:$clamp)),
+               (V_DOT8_I32_IU4  (i32 8), i32:$src0,
----------------
arsenm wrote:
> jrbyrnes wrote:
> > jrbyrnes wrote:
> > > arsenm wrote:
> > > > jrbyrnes wrote:
> > > > > arsenm wrote:
> > > > > > jrbyrnes wrote:
> > > > > > > arsenm wrote:
> > > > > > > > I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name?
> > > > > > > On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). 
> > > > > > > 
> > > > > > > This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions.
> > > > > > > 
> > > > > > Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change
> > > > > Apologies, It is my mistake potentially causing confusion.
> > > > > 
> > > > > The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. 
> > > > > 
> > > > > 
> > > > > 
> > > > > See llvm.amdgcn.sudot4 for unit tests.
> > > > I mean tests that actually execute, not lit tests
> > > So I've tracked down some unit tests.
> > > 
> > > https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34
> > > 
> > > These calls are implemented as calls to __ockl_sdot4:
> > > 
> > > https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60
> > > 
> > > Which is, in turn, implemented as calls to target specific builtins:
> > > 
> > > https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124
> > > 
> > > For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. 
> > > 
> > > 
> > > If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests.
> > > 
> > Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100.
> Ugh, this test is bad. It barely tests it compiles. Really these should test all the edge cases
So apparently we have overlapping intrinsics. We should probably canonicalize llvm.amdgcn.sudot4 cases representable with sdot/udot in AMDGPUInstCombineIntrinsic

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158468/new/

https://reviews.llvm.org/D158468