[PATCH] D134433: [AMDGPU][GISel] Enable Matching of V2S16 G_BUILD_VECTOR

Thu Sep 29 09:13:40 PDT 2022

arsenm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:626-629
+  if (MRI->getType(Dst) != LLT::fixed_vector(2, 16) ||
+      (MI.getOpcode() == AMDGPU::G_BUILD_VECTOR_TRUNC &&
+       SrcTy != LLT::scalar(32)))
+    return selectImpl(MI, *CoverageInfo);
----------------
Theoretically the type checks should be unnecessary. If the legality rules are correct the verification should reject the illegally typed operations 

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:674-675
+
   // TODO: This should probably be a combine somewhere
-  // (build_vector_trunc $src0, undef -> copy $src0
+  // (build_vector $src0, undef)  -> copy $src0
   MachineInstr *Src1Def = getDefIgnoringCopies(Src1, *MRI);
----------------
Should look into dropping this next

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:684
+  // TODO: Can be improved?
+  if (IsVector) {
+    Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
----------------
Pierre-vh wrote:
> arsenm wrote:
> > D134463 switches to using v_perm_b32 here. Most everything in this function should be handled by tablegen though
> Do you mean that D134463 will make this code path obsolete/dead? Do I need to rebase on top of it?
> My understanding is that this won't interfere with D134463 as that diff adds tablegen patterns (which are matched above) and this code path only triggers when tablegen doesn't match
> 
> 
All of this code should really go through the same tablegen patterns. A future change should try to get rid of this custom code. The only custom selected case in the DAG path is for constants (and even that could be moved to tablegen)

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:729
-      .moreElementsIf(isSmallOddVector(0), oneMoreElement(0))
-      .clampMaxNumElements(0, S16, 2)
-      .clampScalar(0, S16, S64)
----------------
Pierre-vh wrote:
> arsenm wrote:
> > This shouldn't change
> > This shouldn't change
> 
> Without this, we get the following IR post-legalization:
> ````
>   %86:_(<2 x s16>) = G_FCANONICALIZE %126:_
>   %87:_(<2 x s16>) = G_FCANONICALIZE %124:_
>   %82:_(<2 x s16>) = G_FMAXNUM_IEEE %86:_, %87:_
>   %84:_(<2 x s16>) = G_FCANONICALIZE %127:_
>   %85:_(<2 x s16>) = G_FCANONICALIZE %125:_
>   %83:_(<2 x s16>) = G_FMAXNUM_IEEE %84:_, %85:_
>   %151:_(s32) = G_BITCAST %82:_(<2 x s16>)
>   %74:_(s16) = G_TRUNC %151:_(s32)
>   %152:_(s32) = G_LSHR %151:_, %135:_(s32)
>   %75:_(s16) = G_TRUNC %152:_(s32)
>   %153:_(s32) = G_BITCAST %83:_(<2 x s16>)
>   %76:_(s16) = G_TRUNC %153:_(s32)
>   %130:_(<2 x s16>) = G_BUILD_VECTOR %74:_(s16), %75:_(s16)
>   %131:_(<2 x s16>) = G_BUILD_VECTOR %76:_(s16), %40:_(s16)
>   %128:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %33:_(s16)
>   %129:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %40:_(s16)
>   %63:_(<2 x s16>) = G_FCANONICALIZE %130:_
>   %64:_(<2 x s16>) = G_FCANONICALIZE %128:_
>   %59:_(<2 x s16>) = G_FMINNUM_IEEE %63:_, %64:_
>   %61:_(<2 x s16>) = G_FCANONICALIZE %131:_
>   %62:_(<2 x s16>) = G_FCANONICALIZE %129:_
>   %60:_(<2 x s16>) = G_FMINNUM_IEEE %61:_, %62:_
> ``` 
> 
> in the following test:
> 
> ```
> ; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_postcvt:
> ; GCN: s_waitcnt
> ; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
> ; GFX900-DAG: v_mad_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
> ; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp
> 
> ; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
> ; GFX906-DAG: v_fma_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
> ; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp
> 
> ; GFX9: v_mov_b32_e32 v0, v{{[0-9]+}}
> ; GFX9-NEXT: s_setpc_b64
> define <3 x half> @v_mad_mix_v3f32_clamp_postcvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
>   %src0.ext = fpext <3 x half> %src0 to <3 x float>
>   %src1.ext = fpext <3 x half> %src1 to <3 x float>
>   %src2.ext = fpext <3 x half> %src2 to <3 x float>
>   %result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
>   %cvt.result = fptrunc <3 x float> %result to <3 x half>
>   %max = call <3 x half> @llvm.maxnum.v3f16(<3 x half> %cvt.result, <3 x half> zeroinitializer)
>   %clamp = call <3 x half> @llvm.minnum.v3f16(<3 x half> %max, <3 x half> <half 1.0, half 1.0, half 1.0>)
>   ret <3 x half> %clamp
> }
> ```
> 
> I can see three possibilities why:
> - The legalizer rule should go (easiest, it's why I did that for now)
> - The FPMinMadToClamp combine should be pre-legalizer instead of post-regbankcombiner 
> - This test is wrong/shouldn't fold
> 
> What do you prefer?
The goal is to produce the packed minnum/maxnum. The mess around it is for other combines 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134433/new/

https://reviews.llvm.org/D134433