[PATCH] D158468: [AMDGPU] Support sdot4 / sdot8 intrinsics on gfx11

Wed Aug 23 09:48:39 PDT 2023

jrbyrnes added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/VOP3PInstructions.td:437-452
 defm V_DOT4_I32_IU8 : VOP3PDOTIUInst<"v_dot4_i32_iu8", int_amdgcn_sudot4>;
 defm V_DOT8_I32_IU4 : VOP3PDOTIUInst<"v_dot8_i32_iu4", int_amdgcn_sudot8>;
+
+def : GCNPat < (int_amdgcn_sdot8 i32:$src0,
+                                 i32:$src1,
+                                 i32:$src2, (i1 timm:$clamp)),
+               (V_DOT8_I32_IU4  (i32 8), i32:$src0,
----------------
jrbyrnes wrote:
> arsenm wrote:
> > jrbyrnes wrote:
> > > arsenm wrote:
> > > > jrbyrnes wrote:
> > > > > arsenm wrote:
> > > > > > I don't understand how these cases are different, the intrinsic name is just slightly different from the instruction name?
> > > > > On all other targets with 8bit and 4bit signed dot, we codegen for int_amdgcn_sdot4 and int_amdgcn_sdot8. However, we don't support these on gfx1100 -- instead, gfx100 has int_amdgcn_sUdot4 / int_amdgcn_sUdot8. The result is that users of these intrinsics must always check the target to use the corresponding one (sudot4 for gfx1100, and sdot4 for all others). 
> > > > > 
> > > > > This removes that responsibility from the user, so they are able to use sdot4 across all targets and generate the corresponding instructions.
> > > > > 
> > > > Are there unit tests for these somewhere? I don't really know the full history of these instructions and I'm worried there was some random edge case behavior change
> > > Apologies, It is my mistake potentially causing confusion.
> > > 
> > > The main difference between V_DOT4_I32_IU8 on gfx1100 and V_DOT4_I32_I8 on gfx90a (for example), is that V_DOT4_I32_IU8 can be either signed or unsigned depending on NEG bit in operand modifier. This target specific feature is probably why there is special handling. 
> > > 
> > > 
> > > 
> > > See llvm.amdgcn.sudot4 for unit tests.
> > I mean tests that actually execute, not lit tests
> So I've tracked down some unit tests.
> 
> https://github.com/ROCm-Developer-Tools/HIP/blob/b8965f1f3d58d7adf7d702c09e75ebf3dd718f8c/tests/src/deviceLib/hipTestDotFunctions.cpp#L34
> 
> These calls are implemented as calls to __ockl_sdot4:
> 
> https://github.com/ROCm-Developer-Tools/clr/blob/5914ac3c6e9b3848023a7fa25e19e560b1c38541/hipamd/include/hip/amd_detail/amd_math_functions.h#L148C60-L148C60
> 
> Which is, in turn, implemented as calls to target specific builtins:
> 
> https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/46939af92ad91238c878a82aad2220822073ffa1/ockl/src/dots.cl#L124
> 
> For gfx1100, this lowers to __builtin_amdgcn_sudot4 builtin. 
> 
> 
> If you want, I can hack a compiler to lower the __builtin_amdgcn_sudot4 into int_amdgcn_sdot4 and find a way to run these tests.
> 
Probably worth mentioning is that I have been validating correctness using CK 8 bit and 16 bit test suite, which -- due to https://reviews.llvm.org/D155995 -- has many existing tests that lower into int_amdgcn_sdot4 for gfx1100.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158468/new/

https://reviews.llvm.org/D158468