[PATCH] D134433: [AMDGPU][GISel] Enable Matching of V2S16 G_BUILD_VECTOR

Mon Sep 26 07:44:47 PDT 2022

Pierre-vh marked an inline comment as done.
Pierre-vh added a comment.

For this to land, are there things I must migrate to TableGen or is the diff in good shape and just needs a bit of polishing ?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUCombine.td:91-95
+def const_build_vector_fold : GICombineRule<
+  (defs root:$buildvector, uint64_matchinfo:$matchinfo),
+  (match (wip_match_opcode G_BUILD_VECTOR, G_BUILD_VECTOR_TRUNC):$buildvector,
+         [{ return RegBankHelper.matchConstBuildVectorFold(*${buildvector}, ${matchinfo}); }]),
+  (apply [{ RegBankHelper.applyConstBuildVectorFold(*${buildvector}, ${matchinfo}); }])>;
----------------
arsenm wrote:
> This combine can be a separate patch
Actually it was a leftover from a previous attempt. I totally removed the combine now and refactored the InstructionSelection logic to fix the testcases the combine intended to fix. Seems more stable/cleaner.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:639-641
+  // First, before trying TableGen patterns, check if both sources are
+  // constants. In those cases, we can trivially compute the final constant
+  // and emit a simple move.
----------------
arsenm wrote:
> We could technically do this in tablegen. Not sure why this was manual in the DAG path
I think there are some annoying cases like a copy in-between the constant and the build_vector that make it annoying to handle in TableGen

Is it fine to leave this in Cpp or do I need to migrate this to TableGen for this to land? The diff is already quite large and I wanted to avoid moving things between Cpp/TableGen to avoid adding more complexity to it

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:684
+  // TODO: Can be improved?
+  if (IsVector) {
+    Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
----------------
arsenm wrote:
> D134463 switches to using v_perm_b32 here. Most everything in this function should be handled by tablegen though
Do you mean that D134463 will make this code path obsolete/dead? Do I need to rebase on top of it?
My understanding is that this won't interfere with D134463 as that diff adds tablegen patterns (which are matched above) and this code path only triggers when tablegen doesn't match

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:729
-      .moreElementsIf(isSmallOddVector(0), oneMoreElement(0))
-      .clampMaxNumElements(0, S16, 2)
-      .clampScalar(0, S16, S64)
----------------
arsenm wrote:
> This shouldn't change
> This shouldn't change

Without this, we get the following IR post-legalization:
````
  %86:_(<2 x s16>) = G_FCANONICALIZE %126:_
  %87:_(<2 x s16>) = G_FCANONICALIZE %124:_
  %82:_(<2 x s16>) = G_FMAXNUM_IEEE %86:_, %87:_
  %84:_(<2 x s16>) = G_FCANONICALIZE %127:_
  %85:_(<2 x s16>) = G_FCANONICALIZE %125:_
  %83:_(<2 x s16>) = G_FMAXNUM_IEEE %84:_, %85:_
  %151:_(s32) = G_BITCAST %82:_(<2 x s16>)
  %74:_(s16) = G_TRUNC %151:_(s32)
  %152:_(s32) = G_LSHR %151:_, %135:_(s32)
  %75:_(s16) = G_TRUNC %152:_(s32)
  %153:_(s32) = G_BITCAST %83:_(<2 x s16>)
  %76:_(s16) = G_TRUNC %153:_(s32)
  %130:_(<2 x s16>) = G_BUILD_VECTOR %74:_(s16), %75:_(s16)
  %131:_(<2 x s16>) = G_BUILD_VECTOR %76:_(s16), %40:_(s16)
  %128:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %33:_(s16)
  %129:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %40:_(s16)
  %63:_(<2 x s16>) = G_FCANONICALIZE %130:_
  %64:_(<2 x s16>) = G_FCANONICALIZE %128:_
  %59:_(<2 x s16>) = G_FMINNUM_IEEE %63:_, %64:_
  %61:_(<2 x s16>) = G_FCANONICALIZE %131:_
  %62:_(<2 x s16>) = G_FCANONICALIZE %129:_
  %60:_(<2 x s16>) = G_FMINNUM_IEEE %61:_, %62:_
``` 

in the following test:

```
; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_postcvt:
; GCN: s_waitcnt
; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
; GFX900-DAG: v_mad_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
; GFX906-DAG: v_fma_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

; GFX9: v_mov_b32_e32 v0, v{{[0-9]+}}
; GFX9-NEXT: s_setpc_b64
define <3 x half> @v_mad_mix_v3f32_clamp_postcvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
  %src0.ext = fpext <3 x half> %src0 to <3 x float>
  %src1.ext = fpext <3 x half> %src1 to <3 x float>
  %src2.ext = fpext <3 x half> %src2 to <3 x float>
  %result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
  %cvt.result = fptrunc <3 x float> %result to <3 x half>
  %max = call <3 x half> @llvm.maxnum.v3f16(<3 x half> %cvt.result, <3 x half> zeroinitializer)
  %clamp = call <3 x half> @llvm.minnum.v3f16(<3 x half> %max, <3 x half> <half 1.0, half 1.0, half 1.0>)
  ret <3 x half> %clamp
}
```

I can see three possibilities why:
- The legalizer rule should go (easiest, it's why I did that for now)
- The FPMinMadToClamp combine should be pre-legalizer instead of post-regbankcombiner 
- This test is wrong/shouldn't fold

What do you prefer?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134433/new/

https://reviews.llvm.org/D134433