[PATCH] D61289: [globalisel] Add G_SEXT_INREG

Tue May 21 18:46:16 PDT 2019

dsanders marked 3 inline comments as done.
dsanders added a comment.

Sorry for the slow reply. I had to work on other things for a couple days.

================
Comment at: llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp:698
     return Legalized;
+  case TargetOpcode::G_SEXT_INREG: {
+    if (TypeIdx != 0)
----------------
Petar.Avramovic wrote:
> dsanders wrote:
> > Petar.Avramovic wrote:
> > > dsanders wrote:
> > > > Petar.Avramovic wrote:
> > > > > NarrowScalar is good candidate for separate patch. Reason is that artifact combiner currently can not handle test that would use  G_SEXT_INREG since it will have chained artifacts.  G_SEXT will be  G_UNMERGE-d, G_UNMERGE has to wait for: 
> > > > >  - G_SEXT to be combined into G_SEXT_INREG,
> > > > >  - and then for G_SEXT_INREG to be narrowScalared into sequence of instructions that will end with G_MERGE_VALUES. 
> > > > > Only at this point G_UNMERGE can be combined.
> > > > > 
> > > > > 
> > > > There's a test for narrowScalar of G_SEXT_INREG in this patch. Could you elaborate on what kind of test you're looking for? I think you're looking for a test with neighbouring instructions which are also narrowScalar'd
> > > Yes. For example with this test on aarch64, store is narrowScalared:
> > > 
> > > ```
> > >   define void @func(i8 %x, i128* %p) {
> > >   entry:
> > >     %conv = sext i8 %x to i128
> > >     store i128 %conv, i128* %p
> > >     ret void
> > >   }
> > > ```
> > > Legalizer is not able to produce something like this
> > > ```
> > >     %2:_(s32) = COPY $w0
> > >     %1:_(p0) = COPY $x1
> > >     %11:_(s64) = G_CONSTANT i64 63
> > >     %12:_(s64) = G_SEXT_INREG %2, 8
> > >     %13:_(s64) = G_ASHR %12, %11(s64)
> > >     G_STORE %12(s64), %1(p0) :: (store 8 into %ir.p, align 16)
> > >     %7:_(s64) = G_CONSTANT i64 8
> > >     %6:_(p0) = G_GEP %1, %7(s64)
> > >     G_STORE %13(s64), %6(p0) :: (store 8 into %ir.p + 8, align 16)
> > >     RET_ReallyLR
> > > ```
> > > which I assume is a desired output.
> > Ah ok. The issue there is that G_STORE hasn't implemented narrowScalar yet. I don't think that we should prevent any opcode from getting a narrowScalar implementation until all opcodes have one. I think we should be implementing them as we need them and gradually build up the full set over time.
> The problem (if we use narrow scalar with G_MERGE/G_UNMERGE) is the order in which we attempt to combine artifacts in Legalizer. D61787 and narrow scalar for G_ANYEXT(with G_MERGE/G_UNMERGE) will be able to give output like this.
I don't understand this. What is the issue with the order artifacts are combined?

================
Comment at: llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp:708
+    // don't need to break it apart.
+    if (NarrowTy.getScalarSizeInBits() >= SizeInBits) {
+      Observer.changingInstr(MI);
----------------
Petar.Avramovic wrote:
> dsanders wrote:
> > Petar.Avramovic wrote:
> > > dsanders wrote:
> > > > Petar.Avramovic wrote:
> > > > > What is the idea behind this block, we generate two instructions with types larger then NarrowTy? G_TRUNC and G_SEXT that are generated have to be narrowScalared and then we have a few more merge/unmerge combines to do.  Also LegalizerHelper does not know how to narrow scalar G_TRUNC and G_SEXT at the moment.
> > > > It's not possible to eliminate types larger than NarrowTy with a single narrowScalar. narrowScalar's job is to replace this instruction with one or more that work on NarrowTy sized types and some legalization artifacts (G_SEXT and G_TRUNC in this case) which are required to maintain type correctness. Those legalization artifacts are either combined away or further legalized according to their legalization rules.
> > > > 
> > > > This block aims to optimize the case where all but one of the narrowed components is a wholly a copy of the sign bit. It expects the G_TRUNC and G_SEXT to be either combined away by the artifact combiner or further legalized. For example: narrowing `%0:_(s32) = G_SEXT %1:_(s32), 8` down to s16 only requires us to preserve the component containing bits 0-15. We can forget about bits 16-31 and reconstruct them with the G_SEXT. This will help chains of instructions to be deleted as they are dead.
> > > I did not consider situation when G_SEXT(G_SEXT_INREG) could combine away. 
> > > Currently G_SEXT can only combine into G_TRUNC.  
> > > The use of this G_SEXT is something legal or G_MERGE that is waiting for G_UNMERGE. 
> > > We know this since uses are legalized before defs (G_SEXT is def in this context).
> > > Yet again if use is legal, then G_SEXT_INREG was legal or something else like lower but not narrow scalar.
> > > Are we talking about artifact combiner, or something that is not in tree?
> > > Could you provide a test where G_SEXT and G_TRUNC produced  with this narrowScalar combine away?
> > > 
> > It does look like the artifact combiner is missing some combines, for the `NarrowTy.getScalarSizeInBits() < SizeInBits` case the resulting:
> >     %6:_(s128) = G_MERGE_VALUES %0(s64), %12(s64)
> >     %7:_(s32) = G_TRUNC %6(s128)
> > ought to be:
> >     %7:_(s32) = G_TRUNC %0(s64)
> > 
> > and for the `NarrowTy.getScalarSizeInBits() >= SizeInBits` there's a:
> >     %10:_(s64) = G_SEXT_INREG %9, 32
> >     %6:_(s128) = G_SEXT %10(s64)
> >     %7:_(s32) = G_TRUNC %6(s128)
> > which firstly ought to be:
> >     %10:_(s64) = G_SEXT_INREG %9, 32
> >     %7:_(s32) = G_TRUNC %10(s64)
> > and secondly:
> >     %10:_(s64) = G_SEXT_INREG %9, 32
> >     %7:_(s32) = %9(s32)
> > the former of those two isn't really related to this patch (the G_SEXT_INREG isn't involved in the combine) but the latter is.
> Sorry, I still don't understand. In this fragment %9 should be s64 not s32. An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.
> Maybe it is possible to perform other combines before we make G_SEXT_INREG  that would benefit from narrow scalar with G_TRUNC/G_SEXT? 
> That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for example some pattern that combines multiple artifacts instead of only two.
> Sorry, I still don't understand. In this fragment %9 should be s64 not s32.
You're right, that step isn't quite right. It should be `%7:_(s32) = G_TRUNC %9(s64)`. The point was to consume the input rather than the output as the lower 32-bits are unchanged by the G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.

> An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.

Any IR where many operations are too large for your target would provide good examples. Consider:
  %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
  %3:_(s64) = G_SEXT_INREG %2:_(s64), 16
and that both have narrowScalar(/*typeidx=*/0, s32). The legalization would proceed to:
  %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
  %4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
  %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
  %3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)
and then to:
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
  %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
  %2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
  %4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
  %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
  %3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)
then the artifact combiner would fold the middle merge/unmerge to:
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
  %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
  %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
  %3:_(s64) = G_MERGE_VALUES %6:_(s32), %8:_(s32)
Notice that we still have the `%8:_(s32) = G_ADD %10:_(s32), %12:_(s32)` at this point even though we're about to overwrite it. We're not going to be able to improve on this until a post-legalize combiner.

Now consider the same case with this optimization:
  %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
  %3:_(s64) = G_SEXT_INREG %2:_(s64), 16
becomes:
  %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
  %4:_(s32) = G_TRUNC %2:_(s64)
  %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
  %3:_(s64) = G_SEXT %6:_(s32)
then:
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
  %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
  %2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
  %4:_(s32) = G_TRUNC %2:_(s64)
  %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
  %3:_(s64) = G_SEXT %6:_(s32)
which the artifact combiner (should) simplify to:
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
  %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
  %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
  %3:_(s64) = G_SEXT %6:_(s32)
and then to:
  %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
  %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
  %3:_(s64) = G_SEXT %6:_(s32)
and then to:
  %9:_(s32) = G_TRUNC %0:_(s64)
  %11:_(s32) = G_TRUNC %1:_(s64)
  %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
  %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
  %3:_(s64) = G_SEXT %6:_(s32)
which is simpler. The second G_ADD was correctly recognized as being dead and removed. As a result fewer instructions were emitted by the legalizer and subsequent passes have less work to do.

> Maybe it is possible to perform other combines before we make G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT? 
> That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for
> example some pattern that combines multiple artifacts instead of only two.

It's not entirely clear what you're suggesting here. I'm particularly confused by the 'that would benefit from narrow scalar' bit since it's not a choice to narrow scalar or not. An operation is either too wide for the target and must be narrowScalar'd or it's not.

If your suggestion is to try to form G_SEXT_INREG instructions in a combine pass after the legalizer then I'd say that's too late in the pipeline. There's no guarantee that the combine to form a G_SEXT_INREG will run before a combine that makes them unrecognizable. Ideally we want to make a best effort to form them in a pre-legalizer combiner in addition to the legalizer.

================
Comment at: llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp:756
+
+    // Do the operation on each small part.
+    for (int i = 0; i < NumParts; ++i) {
----------------
Petar.Avramovic wrote:
> dsanders wrote:
> > Petar.Avramovic wrote:
> > > dsanders wrote:
> > > > Petar.Avramovic wrote:
> > > > > This loop works for `NarrowTy.getScalarSizeInBits() >= SizeInBits` as well.
> > > > > `NarrowTy.getScalarSizeInBits()` -> `NarrowSize`
> > > > > 
> > > > It does but it potentially generates more instructions to do it. Consider `%0:_(s32) = G_SEXT %1:_(s32), 4` narrowScalar'd to s8. This method will break it into four components and will emit one G_SEXT_INREG and one G_ASHR. The other method will emit one G_SEXT_INREG and one G_SEXT which may be eliminated entirely or at worst, lower to one G_ASHR
> > > Could we then add a possibility for targets to chose how they want to narrow scalar that is to have:
> > > narrowScalar - always creates G_UNMERGE+...+G_MERGE
> > > narrowScalarExt - creates G_TRUNC+...+G_{S|Z|ANY}EXT
> > Assuming we fix the missing artifact combines, what would be the case where G_UNMERGE/G_MERGE is the better option?
> %0:_(s128) = G_ANYEXT .....
> %1:_(s128) = G_SEXT_INREG %0, 8
> %2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
> 
> At the moment most(if not all?) of narrow scalars make sequence of instructions that starts with G_UNMERGE and ends with G_MERGE. In case we emit G_SEXT as end of sequence it won't be able to combine with G_UNMERGE_VALUES. We then have to perform narrow scalar of this G_SEXT (that will end with G_MERGE) then combine this with %2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
> Same for G_TRUNC at the start. Not to mention that this complicates order in which we attempt to combine artifacts.
> And if G_SEXT_INREG is to be narrow scalared I would expect that it is surrounded with G_MERGE and G_UNMERGE, on mips at least.
> Considering llvm test-suite and mips I would say that it is always better to emit G_UNMERGE/G_MERGE. I cannot think of example where G_SEXT would be able to combine with something.
> And for general answer I would say that it is better to emit G_UNMERGE/G_MERGE if def of G_SEXT_INREG  is used in G_UNMERGE_VALUES. This is problematic since it requires legalizer to decide how to narrow scalar based on surrounding instructions. 
> 
> 
> 
You're assuming that we _don't_ fix the missing artifact combines there. Given:
  %0:_(s128) = G_ANYEXT .....
  %1:_(s128) = G_SEXT_INREG %0, 8
  %2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
  ... = ... %2:_(s64)
  ... = ... %3:_(s64)
then assuming you are doing narrowScalar(/*typeidx=*/0, s64) on the G_SEXT_INREG, we should get:
  %0:_(s128) = G_ANYEXT .....
  %5:_(s64) = G_TRUNC %0:_(s128)
  %1:_(s64) = G_SEXT_INREG %0, 8
  %4:_(s128) = G_SEXT %1
  %2:_(s64), %3:_(s64) = G_UNMERGE_VALUES %1:_(s128)
  ... = ... %2:_(s64)
  ... = ... %3:_(s64)
then the artifact combiner should give:
  %0:_(s128) = G_ANYEXT .....
  %5:_(s64) = G_TRUNC %0:_(s128)
  %1:_(s64) = G_SEXT_INREG %0, 8
  %6:_(s64) = G_ASHR %1, i64 63
  ... = ... %1:_(s64)
  ... = ... %6:_(s64)
then if we assume that G_ANYEXT was from anything smaller than s64 (which should be the case for MIPS):
  %5:_(s64) = G_ANYEXT .....
  %1:_(s64) = G_SEXT_INREG %0, 8
  %6:_(s64) = G_ASHR %1, i64 63
  ... = ... %1:_(s64)
  ... = ... %6:_(s64)

> I cannot think of example where G_SEXT would be able to combine with something.

There's plenty of cases where that should be able to combine. The common cases for targets that expect the input MIR to contain s8, s16, s32, s64 operations and need to legalize to only s32 and s64 operations are:
If X == 2Y:
  %1:_(sX) = G_SEXT %0:_(sY)
  %2:_(sY), %3:_(sY) = G_UNMERGE_VALUES %1
  ... = ... %2:_(sY)
  ... = ... %3:_(sY)
to
  %3:_(sY) = G_ASHR %0:_(sY), iY (Y-1)
  ... = ... %0:_(sY)
  ... = ... %3:_(sY)

If X == 4Y:
  %1:_(sX) = G_SEXT %0:_(sY)
  %2:_(sY), %3:_(sY), %4:_(sY), %5:_(sY) = G_UNMERGE_VALUES %1
  ... = ... %2:_(sY)
  ... = ... %3:_(sY)
  ... = ... %4:_(sY)
  ... = ... %5:_(sY)
to
  %3:_(sY) = G_ASHR %0:_(sY), iY (Y-1)
  ... = ... %0:_(sY)
  ... = ... %3:_(sY)
  ... = ... %3:_(sY)
  ... = ... %3:_(sY)

For G_ZEXT, replace the `G_ASHR` with `G_CONSTANT iY 0`. For G_ANYEXT use IMPLICIT_DEF instead

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61289/new/

https://reviews.llvm.org/D61289