[PATCH] D107650: GlobalISel[RFC]: Avoid use of G_INSERT and G_EXTRACT in Legalizer

Fri Aug 6 08:13:09 PDT 2021

Petar.Avramovic created this revision.
Petar.Avramovic added reviewers: foad, arsenm, piotr, mbrkusanin, paquette, aemerson, sebastian-ne.
Herald added subscribers: ctetreau, dexonsmith, kerbowa, hiraditya, tpr, rovka, nhaehnle, jvesely.
Petar.Avramovic requested review of this revision.
Herald added subscribers: llvm-commits, wdng.
Herald added a project: LLVM.

General approach is to use power of unmerge and existing combines. Unmerge to vector elements, avoid unmerge to subvectors (getLCMTy) and use of INSERT or EXTRACT.

List of changes:
moreElementsVectorDst moreElementsVectorSrc: will first unmerge input to each element; Dst than builds vector by leaving out a few trailing elements, Src builds vector and pads it with a few undef elements. Here I use new artifact for this since it shorter and easier to read.
CallLowering: argument lowering don't use LCMTy style, use new artifacts instead (I think it should be possible to refactor to moreElementsVectorDst moreElementsVectorSrc to make this part of code more consistent)
LegalizerHelper/AMDGPULegalizerInfo: Lower G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_SHUFFLE_VECTOR using unmerge/build_vector. Lower for vector loads into power-of-2 load and remaining, element size should be power-of-2, no longer uses more/fewer elts.
AMDGPURegisterBankInfo: Use Unmerge in load lowering
LegalizationArtifactCombiner: G_DELETE_TRAILING_VECTOR_ELTS and G_PAD_VECTOR_WITH_UNDEF_ELTS combine to copy.

Tested on internal test suite, fixes vector related legalizer fails, see: vector-legalizer.ll for some examples.
MIR tests require changes since they contain old argument lowering approach. There is a hack in legalizer for this that detects infinite loops (I did not use it during testing).

Reasoning for new artifacts:
For example amdgpu wants to fewer_vector_elements <3 x s16> to <2 x s16>, but first it has to do more_vector_elements to multiple of <2 x s16> (<4 x s16>)
getLCMTy approach is not combiner friendly since it does:

%LCMTy(<12 x s16>) = G_CONCAT_VECTOR %a(<3 x s16>), %undef0(<3 x s16>), %undef1(<3 x s16>), %undef2(<3 x s16>)
%b(<4 x s16>), %(<4 x s16>), %undef1(<4 x s16>) = G_UNMERGE_VALUES %LCMTy(<12 x s16>)

Here %b takes some elements from %a and some from  %undef0 but combiner has no way to reference those elements (they are not 'named' using VReg), its best chance is to extract_vector_elt or unmerge %a and %undef0 and use build_vector for %b but this creates more artifacts that won't be able to combine and are not legal which may results in infinite loops. It is also extra step compared to the proposal in this patch.

What trivially works is to unmerge/build_vector each vector element (avoid unmerge to subvectors)

since we want to do fewer_elements to <2 x s16>, <3 x s16> will be handled as two <2 x s16> (%x0 and %x1) where we don't care about element %x1[1]

%a is defined something like this:

  %a_big(<4 x s16>) = G_CONCAT_VECTOR %x0(<2 x s16>), %x1(<2 x s16>)
  %a_big0(s16), %a_big1(s16), %a_big2(s16), %a_big3(s16) = G_UNMERGE_VALUES %a_big(<4 x s16>)
  %a(<3 x s16>) = G_BUILD_VECTOR %a_big0(s16), %a_big1(s16), %a_big2(s16)

<3 x s16> -> <4 x s16>

  %a0(s16), %a1(s16), %a2(s16) = G_UNMERGE_VALUES %a(<3 x s16>)
  %b(<4 x s16>) = G_BUILD_VECTOR %a0(s16), %a0(s16), %a0(s16), %Undef(S16)

<4 x s16> -> <2 x s16>

  %b0(s16), %b1(s16), %b2(s16), %b3(s16) = G_UNMERGE_VALUES %b(<4 x s16>)
  %c0(<2 x s16>) = G_BUILD_VECTOR %b0(s16), %b1(s16)
  %c1(<2 x s16>) = G_BUILD_VECTOR %b2(s16), %b3(s16)
  %c(<4 x s16>) = G_CONCAT_VECTOR %c0(<2 x s16>), %c1(<2 x s16>)

Something will that have to unmerge

  %c0(<2 x s16>) , %c1(<2 x s16>) = G_UNMERGE_VALUES %c(<4 x s16>)

combiner is able to figure out that %c0 = %x0

These are all already available combines but the problem is that each steps has to allocate 2 x vector_size VReg that will be combined away.
I propose to add two artifacts that will be trivially combined G_DELETE_TRAILING_VECTOR_ELTS and G_PAD_VECTOR_WITH_UNDEF_ELTS

  %a_big(<4 x s16>) = G_CONCAT_VECTOR %x0(<2 x s16>), %x1(<2 x s16>)
  %a(<3 x s16>) = G_DELETE_TRAILING_VECTOR_ELTS %a_big(<4 x s16>)

<3 x s16> -> <4 x s16>

  %b(<4 x s16>) = G_PAD_VECTOR_WITH_UNDEF_ELTS %a(<3 x s16>)

<4 x s16> -> <2 x s16>

  %c0(<2 x s16>) , %c1(<2 x s16>) = G_UNMERGE_VALUES %b(<4 x s16>)
  %c(<4 x s16>) = G_CONCAT_VECTOR %c0(<2 x s16>), %c1(<2 x s16>)

https://reviews.llvm.org/D107650

Files:
  llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
  llvm/include/llvm/CodeGen/GlobalISel/MIPatternMatch.h
  llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
  llvm/include/llvm/CodeGen/GlobalISel/Utils.h
  llvm/include/llvm/Support/TargetOpcodes.def
  llvm/include/llvm/Target/GenericOpcodes.td
  llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
  llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
  llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
  llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
  llvm/lib/CodeGen/GlobalISel/Utils.cpp
  llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
  llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
  llvm/test/CodeGen/AArch64/GlobalISel/call-lowering-vectors.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement-stack-lower.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/function-returns.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.i16.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/insertelement.large.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-extract.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-fract.f64.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-insert.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-insert.xfail.mir
  llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-call-return-values.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-call.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-sibling-call.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-llvm.amdgcn.image.load.2d.d16.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-llvm.amdgcn.image.store.2d.d16.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.image.load.1d.d16.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/load-constant.96.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/store-local.128.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/store-local.96.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/vector-legalizer-after-legalizer.ll
  llvm/test/CodeGen/AMDGPU/GlobalISel/vector-legalizer.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D107650.364800.patch
Type: text/x-patch
Size: 415083 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210806/d8868597/attachment-0001.bin>