[PATCH] D134870: [AMDGPU][GISel] Better support for V2S16 G_EXTRACT/INSERT_VECTOR_ELT

Thu Sep 29 14:50:48 PDT 2022

arsenm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp:160

+bool AMDGPUPreLegalizerCombinerHelper::matchV2S16ExtractInsertVectorEltFold(
+    MachineInstr &MI, unsigned &Idx) {
----------------
Pierre-vh wrote:
> Pierre-vh wrote:
> > arsenm wrote:
> > > This combine is too targeted (and I'm pretty sure it's redundant). The generic combiner already turns G_INSERT_VECTOR_ELT with constant chains into G_BUILD_VECTOR. 
> > This combine is the most important of the two though
> > I see that there's indeed a matchCombineInsertVecElts/matchExtractVecEltBuildVec but it didn't work at all in the mad_mix cases, I think it's because the source isn't a BUILD_VECTOR in those cases but instead it's operations such as `G_FPTRUNC <2 x 16>` so the combine isn't redundant as far as I can see, it's just more focused on V2S16s
> I'd really like to keep this combine because it makes D134354 work just as well without needing to change the extract/insert vector element legalization rules
> 
> Perhaps I can add some code to check if the operation that sources the vector is illegal, and only do it when the source operation isn't legal (with the vector type)?
> I could maybe even make the combine less targeted, not 2x16 specific but let it work on any operation as long as its source is illegal (and probably going to be scalarized) ? Would that be better ?
It's more important to make progress on the vector handling strategy than trying to get mad_mix to work immediately. Most of these should work for general types, not just this one case.

matchCombineInsertVecElts is primarily looking for G_BUILD_VECTOR sources that it created itself, so the source not being a build_vector should not be an issue. It also looks weirdly limited; for example the loop seems to not accept undef elements. 

I've also been thinking it's likely a mistake to be doing vector combines in terms of the merge/unmerge artifacts. Vector operation combines should probably aim to use other vector operations, and leave the merge/unmerge for the legalizer

Can you compare to what happens in the DAG? It looks to me like <2 x i16> insertelements are turned into shuffles by combineInsertEltToShuffle, which later lower into a build_vector of the extracts of the individual elements

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp:209
+    assert(MI.getOpcode() == AMDGPU::G_EXTRACT_VECTOR_ELT);
+    Register Elt = B.buildUnmerge(EltTy, VecSrc).getReg(Idx);
+
----------------
Pierre-vh wrote:
> arsenm wrote:
> > We don't really want to have 16-bit unmerges. I'm not sure how this combine helps you with the current problem
> It seems that in all of those cases, the unmerge ends up removed because the source value gets scalarized.
> e.g. we have `G_UNMERGE_VALUE (v2s16 (G_FPTRUNC ...))`  but the trunc gets split into two scalar operations, removing the unmerge.
> Why can't we have 16 bit unmerges? Won't the legalizer deal with them if they shouldn't be there?
Because we don't  really have 16-bit registers. An extract of 16-bit values implies a cast and shift of some kind. As I mentioned above, I also think using unmerge outside of the legalizer should probably be avoided

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134870/new/

https://reviews.llvm.org/D134870