[PATCH] D134433: [AMDGPU][GISel] Enable Matching of V2S16 G_BUILD_VECTOR

Thu Sep 29 09:31:43 PDT 2022

arsenm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp:684
+  // TODO: Can be improved?
+  if (IsVector) {
+    Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
----------------
Pierre-vh wrote:
> arsenm wrote:
> > Pierre-vh wrote:
> > > arsenm wrote:
> > > > D134463 switches to using v_perm_b32 here. Most everything in this function should be handled by tablegen though
> > > Do you mean that D134463 will make this code path obsolete/dead? Do I need to rebase on top of it?
> > > My understanding is that this won't interfere with D134463 as that diff adds tablegen patterns (which are matched above) and this code path only triggers when tablegen doesn't match
> > > 
> > > 
> > All of this code should really go through the same tablegen patterns. A future change should try to get rid of this custom code. The only custom selected case in the DAG path is for constants (and even that could be moved to tablegen)
> Is it fine to keep as-is for now, or do I need to move it all to tablegen for this to land?
> I don't mind going back to it later but currently there's a lot of diffs open for mad_mix and it's becoming more and more difficult to differentiate essential changes from "nice improvements". If the current version is acceptable I'd rather keep it for now, is that ok?
Yes, moving to tablegen is a separate change for later

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:729
-      .moreElementsIf(isSmallOddVector(0), oneMoreElement(0))
-      .clampMaxNumElements(0, S16, 2)
-      .clampScalar(0, S16, S64)
----------------
Pierre-vh wrote:
> arsenm wrote:
> > Pierre-vh wrote:
> > > arsenm wrote:
> > > > This shouldn't change
> > > > This shouldn't change
> > > 
> > > Without this, we get the following IR post-legalization:
> > > ````
> > >   %86:_(<2 x s16>) = G_FCANONICALIZE %126:_
> > >   %87:_(<2 x s16>) = G_FCANONICALIZE %124:_
> > >   %82:_(<2 x s16>) = G_FMAXNUM_IEEE %86:_, %87:_
> > >   %84:_(<2 x s16>) = G_FCANONICALIZE %127:_
> > >   %85:_(<2 x s16>) = G_FCANONICALIZE %125:_
> > >   %83:_(<2 x s16>) = G_FMAXNUM_IEEE %84:_, %85:_
> > >   %151:_(s32) = G_BITCAST %82:_(<2 x s16>)
> > >   %74:_(s16) = G_TRUNC %151:_(s32)
> > >   %152:_(s32) = G_LSHR %151:_, %135:_(s32)
> > >   %75:_(s16) = G_TRUNC %152:_(s32)
> > >   %153:_(s32) = G_BITCAST %83:_(<2 x s16>)
> > >   %76:_(s16) = G_TRUNC %153:_(s32)
> > >   %130:_(<2 x s16>) = G_BUILD_VECTOR %74:_(s16), %75:_(s16)
> > >   %131:_(<2 x s16>) = G_BUILD_VECTOR %76:_(s16), %40:_(s16)
> > >   %128:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %33:_(s16)
> > >   %129:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %40:_(s16)
> > >   %63:_(<2 x s16>) = G_FCANONICALIZE %130:_
> > >   %64:_(<2 x s16>) = G_FCANONICALIZE %128:_
> > >   %59:_(<2 x s16>) = G_FMINNUM_IEEE %63:_, %64:_
> > >   %61:_(<2 x s16>) = G_FCANONICALIZE %131:_
> > >   %62:_(<2 x s16>) = G_FCANONICALIZE %129:_
> > >   %60:_(<2 x s16>) = G_FMINNUM_IEEE %61:_, %62:_
> > > ``` 
> > > 
> > > in the following test:
> > > 
> > > ```
> > > ; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_postcvt:
> > > ; GCN: s_waitcnt
> > > ; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
> > > ; GFX900-DAG: v_mad_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
> > > ; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp
> > > 
> > > ; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
> > > ; GFX906-DAG: v_fma_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
> > > ; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp
> > > 
> > > ; GFX9: v_mov_b32_e32 v0, v{{[0-9]+}}
> > > ; GFX9-NEXT: s_setpc_b64
> > > define <3 x half> @v_mad_mix_v3f32_clamp_postcvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
> > >   %src0.ext = fpext <3 x half> %src0 to <3 x float>
> > >   %src1.ext = fpext <3 x half> %src1 to <3 x float>
> > >   %src2.ext = fpext <3 x half> %src2 to <3 x float>
> > >   %result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
> > >   %cvt.result = fptrunc <3 x float> %result to <3 x half>
> > >   %max = call <3 x half> @llvm.maxnum.v3f16(<3 x half> %cvt.result, <3 x half> zeroinitializer)
> > >   %clamp = call <3 x half> @llvm.minnum.v3f16(<3 x half> %max, <3 x half> <half 1.0, half 1.0, half 1.0>)
> > >   ret <3 x half> %clamp
> > > }
> > > ```
> > > 
> > > I can see three possibilities why:
> > > - The legalizer rule should go (easiest, it's why I did that for now)
> > > - The FPMinMadToClamp combine should be pre-legalizer instead of post-regbankcombiner 
> > > - This test is wrong/shouldn't fold
> > > 
> > > What do you prefer?
> > The goal is to produce the packed minnum/maxnum. The mess around it is for other combines 
> I'm not sure I understand, is the legalizer change wrong and I need to look into adding more combines to remove the extra instructions?
> What kind of combine can be done there? I don't see anything obvious
<2 x s16> minnum/maxnum are legal with VOP3P and the legalizer rules should express this and try to legalize wider vectors to <2 x s16> pieces. The MIR here does correctly produce the <2 x s16> operations. The additional context in this particular test isn't relevant to what the legalizer rules here should be (I'm not seeing what the problem is here, other than the legalized MIR has some vector conversion mess)

G_BUILD_VECTOR (TRUNC (BITCAST x)), (TRUNC (LSHR (BITCAST x), 16)) is an identity that can fold out

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134433/new/

https://reviews.llvm.org/D134433