[PATCH] D134870: [AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT

Mon Oct 3 01:08:20 PDT 2022

Pierre-vh added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp:160

+bool AMDGPUPreLegalizerCombinerHelper::matchV2S16ExtractInsertVectorEltFold(
+    MachineInstr &MI, unsigned &Idx) {
----------------
Pierre-vh wrote:
> arsenm wrote:
> > arsenm wrote:
> > > Pierre-vh wrote:
> > > > arsenm wrote:
> > > > > Pierre-vh wrote:
> > > > > > Pierre-vh wrote:
> > > > > > > arsenm wrote:
> > > > > > > > This combine is too targeted (and I'm pretty sure it's redundant). The generic combiner already turns G_INSERT_VECTOR_ELT with constant chains into G_BUILD_VECTOR. 
> > > > > > > This combine is the most important of the two though
> > > > > > > I see that there's indeed a matchCombineInsertVecElts/matchExtractVecEltBuildVec but it didn't work at all in the mad_mix cases, I think it's because the source isn't a BUILD_VECTOR in those cases but instead it's operations such as `G_FPTRUNC <2 x 16>` so the combine isn't redundant as far as I can see, it's just more focused on V2S16s
> > > > > > I'd really like to keep this combine because it makes D134354 work just as well without needing to change the extract/insert vector element legalization rules
> > > > > > 
> > > > > > Perhaps I can add some code to check if the operation that sources the vector is illegal, and only do it when the source operation isn't legal (with the vector type)?
> > > > > > I could maybe even make the combine less targeted, not 2x16 specific but let it work on any operation as long as its source is illegal (and probably going to be scalarized) ? Would that be better ?
> > > > > It's more important to make progress on the vector handling strategy than trying to get mad_mix to work immediately. Most of these should work for general types, not just this one case.
> > > > > 
> > > > > matchCombineInsertVecElts is primarily looking for G_BUILD_VECTOR sources that it created itself, so the source not being a build_vector should not be an issue. It also looks weirdly limited; for example the loop seems to not accept undef elements. 
> > > > > 
> > > > > I've also been thinking it's likely a mistake to be doing vector combines in terms of the merge/unmerge artifacts. Vector operation combines should probably aim to use other vector operations, and leave the merge/unmerge for the legalizer
> > > > > 
> > > > > Can you compare to what happens in the DAG? It looks to me like <2 x i16> insertelements are turned into shuffles by combineInsertEltToShuffle, which later lower into a build_vector of the extracts of the individual elements
> > > > In the DAG: `insert_vector_elt -> vector_shuffle<2,1> -> BUILD_VECTOR # D:1 t69, t44`
> > > > So what we need in the end is a BUILD_VECTOR for the matching to be done.
> > > > 
> > > > What would be a good way forward then if we remove this combine?
> > > >  - Leave the regressions in for now
> > > >  - Add some combine to match `G_BITCAST (G_OR (G_AND ..., ...), ...)` into `G_BUILD_VECTOR`? (aka undoing the legalizer's transform, not especially a fan of that)
> > > >  - Move this combine to the custom legalizer: `customIf V2S16`+ add a code path to handle V2S16 manually.
> > > Without a compelling reason to do something else, I'm inclined to start by following what the DAG is doing. That would mean:
> > > 
> > > 
> > >   # Adding a combine for insert_vector_elt -> shufflevector
> > > 
> > >   # Teaching the default lowering for insert_vector_elt to use a shuffle for a constant index before trying to go to the stack
> > > 
> > > 
> > > 
> > > 
> > Probably also worth porting all the low hanging fruit in visitINSERT_VECTOR_ELT/visitEXTRACT_VECTOR_ELT
> The current DAG combine does it only if the inserted element is a bitcast of a subvector, no?
> ``` 
>   // insert_vector_elt V, (bitcast X from vector type), IdxC -->
>   // bitcast(shuffle (bitcast V), (extended X), Mask)
> ```
> 
> In one of the regressions (caused by reverting this patch) we get the following gMIR:
> 
> ```
> # *** IR Dump Before Legalizer (legalizer) ***:
> # Machine code for function v_mad_mix_v2f32_clamp_postcvt_lo: IsSSA, TracksLiveness
> 
> bb.1 (%ir-block.0):
>   liveins: $vgpr0, $vgpr1, $vgpr2
>   %0:_(<2 x s16>) = COPY $vgpr0
>   %1:_(<2 x s16>) = COPY $vgpr1
>   %2:_(<2 x s16>) = COPY $vgpr2
>   %3:_(<2 x s32>) = G_FPEXT %0:_(<2 x s16>)
>   %4:_(<2 x s32>) = G_FPEXT %1:_(<2 x s16>)
>   %5:_(<2 x s32>) = G_FPEXT %2:_(<2 x s16>)
>   %6:_(<2 x s32>) = G_FMA %3:_, %4:_, %5:_
>   %7:_(<2 x s16>) = G_FPTRUNC %6:_(<2 x s32>)
>   %9:_(s32) = G_CONSTANT i32 0
>   %8:_(s16) = G_EXTRACT_VECTOR_ELT %7:_(<2 x s16>), %9:_(s32)
>   %10:_(s16) = G_FCONSTANT half 0xH0000
>   %11:_(s16) = G_FMAXNUM %8:_, %10:_
>   %12:_(s16) = G_FCONSTANT half 0xH3C00
>   %13:_(s16) = G_FMINNUM %11:_, %12:_
>   %14:_(<2 x s16>) = G_INSERT_VECTOR_ELT %7:_, %13:_(s16), %9:_(s32)
>   $vgpr0 = COPY %14:_(<2 x s16>)
>   SI_RETURN implicit $vgpr0
> ```
> 
> Ideally we need to get rid of both insert/extract vector elt and/or match them to BUILD_VECTOR.
> The first extract should be easy enough, maybe worst case I'll need to add some new lowering logic to use bitcast + trunc for elt 0.
> 
> For the second case, how would lowering/combining look like? Do I do a BUILD_VECTOR from %13 & a undef, and then shuffle the result?
> Do you think this is more fitting for a new custom lowering for INSERT_VECTOR_ELT (constrained to V2S16?) or for a pre-legal combine?
> 
> 
Did a "quick and dirty" implementation in our `legalizeInsertVectorElt`  (+ `customFor({V2S16, S16})`, and it seems to work like a charm for mad-mix tests (other tests have crashes because it's hacked-in for this specific case, ofc) - no regressions, only improvements (without this patch and on top of D134967). 

```lang=cpp
    const auto Undef = MRI.createGenericVirtualRegister(EltTy);
    B.buildUndef(Undef);

    const auto OtherVec = MRI.createGenericVirtualRegister(VecTy);
    B.buildBuildVector(OtherVec, { Ins, Undef });

    // 2 == Ins in OtherVec
    // 0/1 = Original Elts
    if(IdxVal == 0) {
      B.buildShuffleVector(Dst, Vec, OtherVec, {2, 1});
    } else {
      assert(IdxVal == 1);
      B.buildShuffleVector(Dst, Vec, OtherVec, {0, 2});
    }
```

Would it be fire to just implement it as custom lowering? Or should it be a pre-legalizer combine?

Also, what should the heuristics be for this transformation to apply (on top of the index being constant). Should we only do it for small vectors (<2 elt?), V2S16 only? If we do it all the time it'll just replace the current custom legalization (not sure it's what we want)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134870/new/

https://reviews.llvm.org/D134870