[PATCH] D61787: [GlobalISel Legalizer] Improve artifact combiner

Fri May 24 05:25:27 PDT 2019

Petar.Avramovic marked 17 inline comments as done.
Petar.Avramovic added a comment.

'crash' = unable-to-legalize error/warning/nothing depending on -global-isel-abort option.

> It looks like a lot of these test changes result in illegal MIR getting out of the legalizer pass.

That is already the case. Mentioned tests already used to crash but now they always crash on instruction that is root of problem (here it is most of the times G_TRUNC for vector) . 
About mentioned legalization rules violations:
Tests that finish legalization successfully are unchanged (except one where we perform additional G_UNMERGE/G_MERGE combine). 
"Violations" happen for Artifacts in tests where mir output was illegal anyway.
That it Artifacts are either waiting for it uses to be resolved first or are about to be legalized. It just happened that there was some "artifact"/"auxiliary artifact"/instruction that was not legal. 
 @arsenm Do you have any comments?

> what would be the main benefit to making this change? I get the impression it's about avoiding implementing all the legalization rules but while I can see that it postpones the need to implement some, I'm not convinced it reduces the amount rules that need defining.

The primary goal of this patch is adding support for combining of chained artifacts.

About side effects/benefits:
Remove some of the rules for artifacts. The ones that say artifact is legal for a type but we know it is not selectable for that type.

Suppose that target cannot select `s128 = G_MERGE s32 s32 s32 s32` nor  `s32 s32 s32 s32 = G_UNMERGE s128`. With current artifact combiner we could encounter following situations:

1. G_UNMERGE combines with G_MERGE, that was easy.
2. There was no G_UNMERGE, that could combine with G_MERGE. Then let's legalize G_MERGE, and hope for the best. G_MERGE is now legal, removed from all GlobalISel observer lists and waits for legalization to finish so that it can be regbankselected. During legalization of some other instruction G_UNMERGE appears and combines away G_MERGE. This was fortunate sequence of events as we couldn't select this G_MERGE if it "escaped" from Legalizer. But in LegalizerInfo it says that G_MERGE is legal for {s128, s32} despite the fact that we cannot select it. Why? Because that is how current Artifact combiner works.
3. G_UNMERGE couldn't find G_MERGE to combine with. G_UNMERGE is "Artifact" nothing can combine it away, then let's make it legal. G_UNMERGE is now a legalized instruction that is no more in ArtifactList and waits legalizer to end so that it can crash in Instruction Select. But there is more: other instruction get legalized (e.g. G_SEXT_INREG get narrowscalar-ed with G_UNMERGE/G_MERGE) and the G_MERGE appears. G_UNMERGE could combine with this G_MERGE but it is not in ArtifactList and we end up with uncombined Artifacts.

Artifact combiner at the moment does 1 and 2. With this change Artifact combiner does 1,2 and 3(meaning it can perform this combine) + combines of type 2. are performed without requirement to have Artifacts legal for types that are not Instr Selectable ( in example above: {s128, s32} for G_MERGE and {s32, s128} for G_UNMERGE). 
By the way 3. is the reason D61289 <https://reviews.llvm.org/D61289> can't have mir test for narrow scalar of G_SEXT_INREG, with expected optimal output.
With this change
Target that has only s32 as legal type (everything else is narrow/widen scalar-ed to s32) is free to declare all artifacts as unsupported as a last rule. This is it could have to narrowScalar G_ANYEXT to s32 but there is no need to declare G_ANYEXT as legal for any types.
We could argue that this is not that beneficial by itself, but since it is a consequence of allowing combine of chained artifacts, I would say it makes legalization rules for artifacts much more straightforward.

> Legalization should be able to happen in any order (i.e. processing order is purely optimization and not correctness)

Instruction legalization can already be done in any order, but artifact combine has to wait for all instructions to be legalized first, right?
Now we allow for chained artifacts to wait for another artifact/instruction to be handled first by introducing TryLaterArtifactList. To me, this feels like a natural extension of existing algorithm.

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/artifact-combiner-sext.mir:56

 ---
 name: test_sext_trunc_v2s32_to_v2s8_to_v2s16
----------------
dsanders wrote:
> Petar.Avramovic wrote:
> >  We start with following MIR:
> > ```
> >   %0:_(<2 x s32>) = COPY $vgpr0_vgpr1
> >   %1:_(<2 x s8>) = G_TRUNC %0
> >   %2:_(<2 x s16>) = G_SEXT %1
> >   $vgpr0 = COPY %2
> > ```
> >  `%2:_(<2 x s16>) = G_SEXT %1` 
> >  is combined, as a result we are left with:
> > ```
> >   %0:_(<2 x s32>) = COPY $vgpr0_vgpr1
> >   %3:_(s16) = G_CONSTANT i16 8
> >   %4:_(<2 x s16>) = G_BUILD_VECTOR %3:_(s16), %3:_(s16)
> >   %5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)
> >   %6:_(<2 x s16>) = G_SHL %5:_, %4:_(<2 x s16>)
> >   %2:_(<2 x s16>) = G_ASHR %6:_, %4:_(<2 x s16>)
> >   $vgpr0 = COPY %2:_(<2 x s16>)
> > ```
> > Originally: 
> >  Two Artifacts failed to combine and ended up in InstList:
> > ```
> >   ...
> >   %5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)
> >   %4:_(<2 x s16>) = G_BUILD_VECTOR %3:_(s16), %3:_(s16)
> > ```
> >  `%4:_(<2 x s16>) = G_BUILD_VECTOR %3:_(s16), %3:_(s16)` 
> >  was legal, 
> >  We have crash on `%5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)`
> > 	
> > Now:
> > ```
> >   %5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)
> >   %4:_(<2 x s16>) = G_BUILD_VECTOR %3:_(s16), %3:_(s16)
> > ```
> >  are Auxiliary Artifacts and will have to wait for something to combine with them.
> >  Eventually `%11:_(s16), %12:_(s16) = G_UNMERGE_VALUES %4:_(<2 x s16>)`
> >  from legalization of `%2:_(<2 x s16>) = G_ASHR %6:_, %4:_(<2 x s16>)`
> >  will combine with `%4:_(<2 x s16>) = G_BUILD_VECTOR %3:_(s16), %3:_(s16)`
> > 
> >  Combine attempt of `%19:_(s16), %20:_(s16) = G_UNMERGE_VALUES %5:_(<2 x s16>)`
> >  will move `%5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)` to InstList
> >  Once we finish with artifacts, we crash on attempt to legalize Instruction 
> >  `%5:_(<2 x s16>) = G_TRUNC %0:_(<2 x s32>)`
> AMDGPU hasn't marked any G_TRUNC's legal. I think this test is behaving correctly until it adds the minimum set of legal truncates for their target. The minimum set is the same as for G_ANYEXT but with the type indices swapped.
Forgot to mention, this will turn into G_SEXT_INREG and then it's another story.
Anyway, I assume that `<2 x s16> = G_TRUNC <2 x s32>` should be legal and that it is a  subregister class copy 
since `$vgpr0_vgpr1` is `<2 x s32>` and `$vgpr0`  is `<2 x s16>`

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-and.mir:514
     ; CHECK: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s8>) = G_BUILD_VECTOR [[TRUNC2]](s8), [[TRUNC3]](s8), [[TRUNC4]](s8), [[TRUNC5]](s8)
-    ; CHECK: [[UV8:%[0-9]+]]:_(s8), [[UV9:%[0-9]+]]:_(s8), [[UV10:%[0-9]+]]:_(s8), [[UV11:%[0-9]+]]:_(s8) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<4 x s8>)
-    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(s32) = G_ANYEXT [[UV8]](s8)
-    ; CHECK: [[ANYEXT9:%[0-9]+]]:_(s32) = G_ANYEXT [[UV9]](s8)
-    ; CHECK: [[ANYEXT10:%[0-9]+]]:_(s32) = G_ANYEXT [[UV10]](s8)
-    ; CHECK: [[ANYEXT11:%[0-9]+]]:_(s32) = G_ANYEXT [[UV11]](s8)
-    ; CHECK: [[BUILD_VECTOR1:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[ANYEXT8]](s32), [[ANYEXT9]](s32), [[ANYEXT10]](s32), [[ANYEXT11]](s32)
-    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[BUILD_VECTOR1]](<4 x s32>)
+    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(<4 x s32>) = G_ANYEXT [[BUILD_VECTOR]](<4 x s8>)
+    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[ANYEXT8]](<4 x s32>)
----------------
dsanders wrote:
> This change is violating AMDGPU's legalization rules. G_ANYEXT is defined as:
>   getActionDefinitionsBuilder({G_SEXT, G_ZEXT, G_ANYEXT})
>     .legalFor({{S64, S32}, {S32, S16}, {S64, S16},
>                {S32, S1}, {S64, S1}, {S16, S1},
>                // FIXME: Hack
>                {S64, LLT::scalar(33)},
>                {S32, S8}, {S128, S32}, {S128, S64}, {S32, LLT::scalar(24)}})
>     .scalarize(0);
> which is instructing the legalizer to scalarize all vector types.
We crashed before attempting to legalize this G_ANYEXT, once G_TRUNC is resolved  G_ANYEXT will be next.

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-extract.mir:907
     ; CHECK: [[EXTRACT:%[0-9]+]]:_(<5 x s16>) = G_EXTRACT [[TRUNC]](<6 x s16>), 0
-    ; CHECK: [[DEF1:%[0-9]+]]:_(<6 x s32>) = G_IMPLICIT_DEF
-    ; CHECK: [[TRUNC1:%[0-9]+]]:_(<6 x s16>) = G_TRUNC [[DEF1]](<6 x s32>)
-    ; CHECK: [[INSERT:%[0-9]+]]:_(<6 x s16>) = G_INSERT [[TRUNC1]], [[EXTRACT]](<5 x s16>), 0
-    ; CHECK: [[EXTRACT1:%[0-9]+]]:_(<2 x s16>) = G_EXTRACT [[INSERT]](<6 x s16>), 0
+    ; CHECK: [[EXTRACT1:%[0-9]+]]:_(<2 x s16>) = G_EXTRACT [[EXTRACT]](<5 x s16>), 0
     ; CHECK: $vgpr0 = COPY [[EXTRACT1]](<2 x s16>)
----------------
dsanders wrote:
> This change is violating AMDGPU's legalization rules. The relevant one for the G_EXTRACT is:
>         .moreElementsIf(isSmallOddVector(BigTyIdx), oneMoreElement(BigTyIdx))
> which is supposed to add an extra element if the number of elements is odd and the size of the elements is <32. The legalIf that preceeds it only matches if the bit type is a multiple of 32 in size but 5*16 isn't
This is intended, G_EXTRACT waits for it artifact use to be resolved ( [[EXTRACT]] ). Crash happens before we try to legalize G_EXTRACT.
We don't give up on attempt to combine Artifact while there is still chance that corresponding auxiliary artifact would appear once uses of G_EXTRACT are resolved. Notice how we process Machine function, it is InstList first not ArtifactList first. This way we cover a lot of simpler cases but not the cases when artifacts are chained, then we have to wait and try again later.

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-implicit-def.mir:324
     ; CHECK: [[EXTRACT:%[0-9]+]]:_(<3 x s1>) = G_EXTRACT [[TRUNC]](<4 x s1>), 0
-    ; CHECK: [[UV:%[0-9]+]]:_(s1), [[UV1:%[0-9]+]]:_(s1), [[UV2:%[0-9]+]]:_(s1) = G_UNMERGE_VALUES [[EXTRACT]](<3 x s1>)
-    ; CHECK: [[ANYEXT:%[0-9]+]]:_(s32) = G_ANYEXT [[UV]](s1)
-    ; CHECK: [[ANYEXT1:%[0-9]+]]:_(s32) = G_ANYEXT [[UV1]](s1)
-    ; CHECK: [[ANYEXT2:%[0-9]+]]:_(s32) = G_ANYEXT [[UV2]](s1)
-    ; CHECK: [[BUILD_VECTOR:%[0-9]+]]:_(<3 x s32>) = G_BUILD_VECTOR [[ANYEXT]](s32), [[ANYEXT1]](s32), [[ANYEXT2]](s32)
-    ; CHECK: $vgpr0_vgpr1_vgpr2 = COPY [[BUILD_VECTOR]](<3 x s32>)
+    ; CHECK: [[ANYEXT:%[0-9]+]]:_(<3 x s32>) = G_ANYEXT [[EXTRACT]](<3 x s1>)
+    ; CHECK: $vgpr0_vgpr1_vgpr2 = COPY [[ANYEXT]](<3 x s32>)
----------------
dsanders wrote:
> This is another case of violating AMDGPU's G_ANYEXT legalization rules by not scalarizing vector types
We did not try to legalize this yet G_ANYEXT since it has unresolved artifact use, and mir output is state of MF when we crashed on `%2:_(<4 x s1>) = G_TRUNC %3:_(<4 x s32>)` 

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-or.mir:514
     ; CHECK: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s8>) = G_BUILD_VECTOR [[TRUNC2]](s8), [[TRUNC3]](s8), [[TRUNC4]](s8), [[TRUNC5]](s8)
-    ; CHECK: [[UV8:%[0-9]+]]:_(s8), [[UV9:%[0-9]+]]:_(s8), [[UV10:%[0-9]+]]:_(s8), [[UV11:%[0-9]+]]:_(s8) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<4 x s8>)
-    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(s32) = G_ANYEXT [[UV8]](s8)
-    ; CHECK: [[ANYEXT9:%[0-9]+]]:_(s32) = G_ANYEXT [[UV9]](s8)
-    ; CHECK: [[ANYEXT10:%[0-9]+]]:_(s32) = G_ANYEXT [[UV10]](s8)
-    ; CHECK: [[ANYEXT11:%[0-9]+]]:_(s32) = G_ANYEXT [[UV11]](s8)
-    ; CHECK: [[BUILD_VECTOR1:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[ANYEXT8]](s32), [[ANYEXT9]](s32), [[ANYEXT10]](s32), [[ANYEXT11]](s32)
-    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[BUILD_VECTOR1]](<4 x s32>)
+    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(<4 x s32>) = G_ANYEXT [[BUILD_VECTOR]](<4 x s8>)
+    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[ANYEXT8]](<4 x s32>)
----------------
dsanders wrote:
> This is another case of violating AMDGPU's G_ANYEXT legalization rules by not scalarizing vector types
We crashed before attempting to legalize this G_ANYEXT, once G_TRUNC is resolved G_ANYEXT will be next. The trunc in question is  `CHECK: [[TRUNC:%[0-9]+]]:_(<4 x s8>) = G_TRUNC [[DEF]](<4 x s32>)`

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-unmerge-values-xfail.mir:2
 # RUN: not llc -mtriple=amdgcn-- -O0 -run-pass=legalizer -o - %s 2>&1 | FileCheck %s

-# CHECK: LLVM ERROR: unable to legalize instruction: %1:_(s1), %2:_(s1) = G_UNMERGE_VALUES %0:_(<2 x s1>) (in function: test_unmerge_v2s1)
----------------
dsanders wrote:
> Petar.Avramovic wrote:
> > Originally: 
> > We used to crash on:
> >  `%1:_(s1), %2:_(s1) = G_UNMERGE_VALUES %0`
> > 
> > Now: 
> > We noticed that this artifact has use 
> >  `%0:_(<2 x s1>) = G_TRUNC %3:_(<2 x s32>)`, this is where we crash now.
> This one is interesting. AMDGPU's legalization rules say to scalarize this (assuming I'm reading it right, this opcode has very complicated legalization rules), but how do you scalarize G_UNMERGE_VALUES. The normal answer is to emit G_UNMERGE_VALUES but that's not going to work here as we'll just infinitely loop replacing an instruction with the same instruction. I guess G_EXTRACT_VECTOR_ELT is probably the answer here but suppose we want to widenScalar that. We'd probably want to G_UNMERGE_VALUES (bring us back into the loop) and G_BUILD_VECTOR. I'll have to give this one some more thought
Originally this was `widenScalarUnmergeValues` called for vector, and it crashed since vector is not scalar.
>From second rule for G_MERGE/G_UNMERGE     `.clampScalar(LitTyIdx, S16, S256)`.
Don't know what was idea behind this test.

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-unmerge-values.mir:239
 ...

 ---
----------------
dsanders wrote:
> Petar.Avramovic wrote:
> > In test function test_unmerge_s1_s8:
> >  We have a combine for G_MERGE + G_UNMERGE that did not happen before.
> >  Remaining changes are variable name changes by the script, because of combine, 
> >  first merge {MV} and first unmerge{UV} are now missing, and MV1 is now MV, MV2 is MV1 and so on. 
> >  There was no crash here so I removed -global-isel-abort=0
> I believe this is the test where I mentioned to Matt that the lack of G_TRUNC legalization rules caused it to stop processing before the combines could happen. It looks like this patch makes the combines happen a bit earlier and before the G_TRUNC stops further processing
This one finished successfully. G_TRUNC is legal for scalars according to the "legacy rules" that are used when no rules are defined... This is not the case for vectors.
In order to avoid duality of G_TRUNC being both "artifact" and "auxiliary artifact" I suggest to have narrowScalar for G_TRUNC (similar for vectors) so that G_TRUNC could be legalized into G_UNMERGE + COPY and Artifact Combiner would continue with combining. This works with patch like it is.
D62338 complicates things with order in which we attempt to combine. 
Maybe instead for moving G_TRUNC to InstList we move it to ArtifactList first?
No change for remaining auxiliary artifacts since they don't have such combine.

================
Comment at: test/CodeGen/AMDGPU/GlobalISel/legalize-xor.mir:514
     ; CHECK: [[BUILD_VECTOR:%[0-9]+]]:_(<4 x s8>) = G_BUILD_VECTOR [[TRUNC2]](s8), [[TRUNC3]](s8), [[TRUNC4]](s8), [[TRUNC5]](s8)
-    ; CHECK: [[UV8:%[0-9]+]]:_(s8), [[UV9:%[0-9]+]]:_(s8), [[UV10:%[0-9]+]]:_(s8), [[UV11:%[0-9]+]]:_(s8) = G_UNMERGE_VALUES [[BUILD_VECTOR]](<4 x s8>)
-    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(s32) = G_ANYEXT [[UV8]](s8)
-    ; CHECK: [[ANYEXT9:%[0-9]+]]:_(s32) = G_ANYEXT [[UV9]](s8)
-    ; CHECK: [[ANYEXT10:%[0-9]+]]:_(s32) = G_ANYEXT [[UV10]](s8)
-    ; CHECK: [[ANYEXT11:%[0-9]+]]:_(s32) = G_ANYEXT [[UV11]](s8)
-    ; CHECK: [[BUILD_VECTOR1:%[0-9]+]]:_(<4 x s32>) = G_BUILD_VECTOR [[ANYEXT8]](s32), [[ANYEXT9]](s32), [[ANYEXT10]](s32), [[ANYEXT11]](s32)
-    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[BUILD_VECTOR1]](<4 x s32>)
+    ; CHECK: [[ANYEXT8:%[0-9]+]]:_(<4 x s32>) = G_ANYEXT [[BUILD_VECTOR]](<4 x s8>)
+    ; CHECK: $vgpr0_vgpr1_vgpr2_vgpr3 = COPY [[ANYEXT8]](<4 x s32>)
----------------
dsanders wrote:
> This one violates AMDGPU's G_ANYEXT rules by not scalarizing
We crashed before attempting to legalize this G_ANYEXT.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61787/new/

https://reviews.llvm.org/D61787