[PATCH] D61289: [globalisel] Add G_SEXT_INREG

Wed Jul 3 10:49:10 PDT 2019

dsanders marked an inline comment as done.
dsanders added inline comments.

================
Comment at: llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp:708
+    // don't need to break it apart.
+    if (NarrowTy.getScalarSizeInBits() >= SizeInBits) {
+      Observer.changingInstr(MI);
----------------
Petar.Avramovic wrote:
> dsanders wrote:
> > Petar.Avramovic wrote:
> > > dsanders wrote:
> > > > Petar.Avramovic wrote:
> > > > > dsanders wrote:
> > > > > > Petar.Avramovic wrote:
> > > > > > > What is the idea behind this block, we generate two instructions with types larger then NarrowTy? G_TRUNC and G_SEXT that are generated have to be narrowScalared and then we have a few more merge/unmerge combines to do.  Also LegalizerHelper does not know how to narrow scalar G_TRUNC and G_SEXT at the moment.
> > > > > > It's not possible to eliminate types larger than NarrowTy with a single narrowScalar. narrowScalar's job is to replace this instruction with one or more that work on NarrowTy sized types and some legalization artifacts (G_SEXT and G_TRUNC in this case) which are required to maintain type correctness. Those legalization artifacts are either combined away or further legalized according to their legalization rules.
> > > > > > 
> > > > > > This block aims to optimize the case where all but one of the narrowed components is a wholly a copy of the sign bit. It expects the G_TRUNC and G_SEXT to be either combined away by the artifact combiner or further legalized. For example: narrowing `%0:_(s32) = G_SEXT %1:_(s32), 8` down to s16 only requires us to preserve the component containing bits 0-15. We can forget about bits 16-31 and reconstruct them with the G_SEXT. This will help chains of instructions to be deleted as they are dead.
> > > > > I did not consider situation when G_SEXT(G_SEXT_INREG) could combine away. 
> > > > > Currently G_SEXT can only combine into G_TRUNC.  
> > > > > The use of this G_SEXT is something legal or G_MERGE that is waiting for G_UNMERGE. 
> > > > > We know this since uses are legalized before defs (G_SEXT is def in this context).
> > > > > Yet again if use is legal, then G_SEXT_INREG was legal or something else like lower but not narrow scalar.
> > > > > Are we talking about artifact combiner, or something that is not in tree?
> > > > > Could you provide a test where G_SEXT and G_TRUNC produced  with this narrowScalar combine away?
> > > > > 
> > > > It does look like the artifact combiner is missing some combines, for the `NarrowTy.getScalarSizeInBits() < SizeInBits` case the resulting:
> > > >     %6:_(s128) = G_MERGE_VALUES %0(s64), %12(s64)
> > > >     %7:_(s32) = G_TRUNC %6(s128)
> > > > ought to be:
> > > >     %7:_(s32) = G_TRUNC %0(s64)
> > > > 
> > > > and for the `NarrowTy.getScalarSizeInBits() >= SizeInBits` there's a:
> > > >     %10:_(s64) = G_SEXT_INREG %9, 32
> > > >     %6:_(s128) = G_SEXT %10(s64)
> > > >     %7:_(s32) = G_TRUNC %6(s128)
> > > > which firstly ought to be:
> > > >     %10:_(s64) = G_SEXT_INREG %9, 32
> > > >     %7:_(s32) = G_TRUNC %10(s64)
> > > > and secondly:
> > > >     %10:_(s64) = G_SEXT_INREG %9, 32
> > > >     %7:_(s32) = %9(s32)
> > > > the former of those two isn't really related to this patch (the G_SEXT_INREG isn't involved in the combine) but the latter is.
> > > Sorry, I still don't understand. In this fragment %9 should be s64 not s32. An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.
> > > Maybe it is possible to perform other combines before we make G_SEXT_INREG  that would benefit from narrow scalar with G_TRUNC/G_SEXT? 
> > > That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for example some pattern that combines multiple artifacts instead of only two.
> > > Sorry, I still don't understand. In this fragment %9 should be s64 not s32.
> > You're right, that step isn't quite right. It should be `%7:_(s32) = G_TRUNC %9(s64)`. The point was to consume the input rather than the output as the lower 32-bits are unchanged by the G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.
> > 
> > > An IR function that results in situation where it is better to narrow scalar with SEXT+TRUNC where they combine with something would be helpful.
> > 
> > Any IR where many operations are too large for your target would provide good examples. Consider:
> >   %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
> >   %3:_(s64) = G_SEXT_INREG %2:_(s64), 16
> > and that both have narrowScalar(/*typeidx=*/0, s32). The legalization would proceed to:
> >   %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
> >   %4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
> >   %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
> >   %3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)
> > and then to:
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> >   %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
> >   %2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
> >   %4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
> >   %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
> >   %3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)
> > then the artifact combiner would fold the middle merge/unmerge to:
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> >   %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
> >   %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
> >   %3:_(s64) = G_MERGE_VALUES %6:_(s32), %8:_(s32)
> > Notice that we still have the `%8:_(s32) = G_ADD %10:_(s32), %12:_(s32)` at this point even though we're about to overwrite it. We're not going to be able to improve on this until a post-legalize combiner.
> > 
> > Now consider the same case with this optimization:
> >   %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
> >   %3:_(s64) = G_SEXT_INREG %2:_(s64), 16
> > becomes:
> >   %2:_(s64) = G_ADD %0:_(s64), %1:_(s64)
> >   %4:_(s32) = G_TRUNC %2:_(s64)
> >   %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
> >   %3:_(s64) = G_SEXT %6:_(s32)
> > then:
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> >   %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
> >   %2:_(s64) = G_MERGE_VALUES %7:_(s32), %8:_(s32)
> >   %4:_(s32) = G_TRUNC %2:_(s64)
> >   %6:_(s32) = G_SEXT_INREG %4:_(s32), 16
> >   %3:_(s64) = G_SEXT %6:_(s32)
> > which the artifact combiner (should) simplify to:
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> >   %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %8:_(s32) = G_ADD %10:_(s32), %12:_(s32)
> >   %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
> >   %3:_(s64) = G_SEXT %6:_(s32)
> > and then to:
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> >   %11:_(s32), %12:_(s32) = G_UNMERGE_VALUES %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
> >   %3:_(s64) = G_SEXT %6:_(s32)
> > and then to:
> >   %9:_(s32) = G_TRUNC %0:_(s64)
> >   %11:_(s32) = G_TRUNC %1:_(s64)
> >   %7:_(s32) = G_ADD %9:_(s32), %11:_(s32)
> >   %6:_(s32) = G_SEXT_INREG %7:_(s32), 16
> >   %3:_(s64) = G_SEXT %6:_(s32)
> > which is simpler. The second G_ADD was correctly recognized as being dead and removed. As a result fewer instructions were emitted by the legalizer and subsequent passes have less work to do.
> > 
> > > Maybe it is possible to perform other combines before we make G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT? 
> > > That is to not create G_SEXT_INREG that would benefit from narrow scalar with G_TRUNC/G_SEXT and solve everything in combiner with for
> > > example some pattern that combines multiple artifacts instead of only two.
> > 
> > It's not entirely clear what you're suggesting here. I'm particularly confused by the 'that would benefit from narrow scalar' bit since it's not a choice to narrow scalar or not. An operation is either too wide for the target and must be narrowScalar'd or it's not.
> > 
> > If your suggestion is to try to form G_SEXT_INREG instructions in a combine pass after the legalizer then I'd say that's too late in the pipeline. There's no guarantee that the combine to form a G_SEXT_INREG will run before a combine that makes them unrecognizable. Ideally we want to make a best effort to form them in a pre-legalizer combiner in addition to the legalizer.
> 
> > The point was to consume the input rather than the output as the lower 32-bits are unchanged by the G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.
> Something like look through copy, but here it is look through G_SEXT_INREG depending on immediate (operand 2) ?
> 
> 
> Assuming Artifact combiner is able to perform all mentioned combines, it gets same output as narrowScalar using G_UNMERGE/G_MERGE + D61787.
> 
> High bits in 
> > `%3:_(s64) = G_MERGE_VALUES %6:_(s32), %5:_(s32)`
> should be sign bit of %6
> 
> `%3:_(s64) = G_SEXT_INREG %2:_(s64), 16`
> narrowScalars to:
> ```
>   %4:_(s32), %5:_(s32) = G_UNMERGE_VALUES %2:_(s64)
>   %8:_(s32) = G_CONSTANT i32 31
>   %6:_(s32) = G_SEXT_INREG %4:_, 16
>   %7:_(s32) = G_ASHR %6:_, %8:_(s32)
>   %3:_(s64) = G_MERGE_VALUES %6:_(s32), %7:_(s32)
> ```
> also
> `%3:_(s64) = G_SEXT %6:_(s32)` has to be narrowScalared/combined away by G_UNMERGE when available.
> 
> 
> 
> I run mentioned example and got following results:
> Setup for test is: applying: D61289, D61289 and D61787, in MipsLegalizerInfo changing mips G_SEXT_INREG to:
> ```
>   getActionDefinitionsBuilder(G_SEXT_INREG)
>      .legalForTypeWithImm({{s32,8},{s32,16}})
>      .maxScalar(0, s32)
>      .lower();
> ```
> and changing G_SEXT_INREG to do narrow scalar with G_UNMERGE/G_MERGE always.
> 
> running following test with -march=mipsel -global-isel -O0 -stop-after=legalizer
> ```
> define i64 @f(i64 signext %a, i64 signext %b) {
> entry:
>   %add = add i64 %a, %b
>   %conv = trunc i64 %add to i16
>   %conv1 = sext i16 %conv to i64
>   ret i64 %conv1
> }
> ```
> 
> gives following result:
> ```
>     %2:_(s32) = COPY $a0
>     %3:_(s32) = COPY $a1
>     %4:_(s32) = COPY $a2
>     %5:_(s32) = COPY $a3
>     %25:_(s32) = G_CONSTANT i32 0
>     %22:_(s32) = G_ADD %2, %4
>     %26:_(s32) = G_CONSTANT i32 1
>     %27:_(s32) = COPY %25(s32)
>     %23:_(s32) = G_AND %27, %26
>     %16:_(s32) = G_ADD %22, %23
>     %24:_(s32) = G_ICMP intpred(ult), %16(s32), %2
>     %20:_(s32) = G_ADD %3, %5
>     %28:_(s32) = COPY %24(s32)
>     %21:_(s32) = G_AND %28, %26
>     %18:_(s32) = G_ADD %20, %21
>     %32:_(s32) = G_CONSTANT i32 31
>     %33:_(s32) = G_SEXT_INREG %16, 16
>     %34:_(s32) = G_ASHR %33, %32(s32)
>     $v0 = COPY %33(s32)
>     $v1 = COPY %34(s32)
>     RetRA implicit $v0, implicit $v1
> ```
> 
> there are few places for improvement:
> first, lets remove dead instructions (this could take place after main do/while loop in Legalizer.cpp)
> ```
>     %2:_(s32) = COPY $a0
> 
>     %4:_(s32) = COPY $a2
> 
>     %25:_(s32) = G_CONSTANT i32 0
>     %22:_(s32) = G_ADD %2, %4
>     %26:_(s32) = G_CONSTANT i32 1
>     %27:_(s32) = COPY %25(s32)
>     %23:_(s32) = G_AND %27, %26
>     %16:_(s32) = G_ADD %22, %23
> 
> 
> 
> 
> 
>     %32:_(s32) = G_CONSTANT i32 31
>     %33:_(s32) = G_SEXT_INREG %16, 16
>     %34:_(s32) = G_ASHR %33, %32(s32)
>     $v0 = COPY %33(s32)
>     $v1 = COPY %34(s32)
>     RetRA implicit $v0, implicit $v1
> ```
> fragment 
> ```
>     %25:_(s32) = G_CONSTANT i32 0
>     %22:_(s32) = G_ADD %2, %4
>     %26:_(s32) = G_CONSTANT i32 1
>     %27:_(s32) = COPY %25(s32)
>     %23:_(s32) = G_AND %27, %26
>     %16:_(s32) = G_ADD %22, %23
> ```
> comes from lower() of
> ```
>   %15:_(s1) = G_CONSTANT i1 false
>   %16:_(s32), %17:_(s1) = G_UADDE %11:_, %13:_, %15:_
> ``` 
> from narrowScalar of G_ADD. If we used
> ```
>   %16:_(s32), %17:_(s1) = G_UADDO %11:_, %13:_
> ```
> for low bits, we would get
> ```
>     %2:_(s32) = COPY $a0
>     %4:_(s32) = COPY $a2
>     %16:_(s32) = G_ADD %2, %4
>     %32:_(s32) = G_CONSTANT i32 31
>     %33:_(s32) = G_SEXT_INREG %16, 16
>     %34:_(s32) = G_ASHR %33, %32(s32)
>     $v0 = COPY %33(s32)
>     $v1 = COPY %34(s32)
>     RetRA implicit $v0, implicit $v1
> ```
> This should be equivalent to mentioned desired output when G_TRUNC/G_SEXT  is used in narrow scalar.
> 
> 
> >It's not entirely clear what you're suggesting here
> ```
> %10:_(s64) = G_SEXT_INREG %9, 32
> %6:_(s128) = G_SEXT %10(s64)
> %7:_(s32) = G_TRUNC %6(s128)
> ```
> 
> Was most likely generated from something like:
> ```
> %11:_(s32) = G_TRUNC %9(s64)
> %10:_(s64) = G_SEXT %11:_(s32)
> %6:_(s128) = G_SEXT %10(s64)
> %7:_(s32) = G_TRUNC %6(s128)
> ```
> I meant that we might be able to figure out that these 4 instructions are equivalent to `%7:_(s32) = G_TRUNC %9(s64)` (combine more then 2 artifact in artifact combiner) instead of combining first two into into G_SEXT_INREG.
> 
> >that would benefit from narrow scalar
> Is there a test that produces better/smaller code when when G_SEXT_INREG is narrow scalar-ed with G_TRUNC/G_SEXT instead of G_UNMERGE/G_MERGE? From discussion so far I am convinced that both approaches generate/should generate same output.
> > The point was to consume the input rather than the output as the lower 32-bits are unchanged by the
> > G_SEXT_INREG. Climbing the dependency chain like this allows the uses of %7 to start sooner.
> Something like look through copy, but here it is look through G_SEXT_INREG depending on immediate
> (operand 2) ?

I suppose it can be viewed that way as both end up simplifying the MIR. I think of them as separate things as this is a pattern-match-and-replace whereas look through copy is something that's done to help the pattern-match part of that by ignoring instructions that aren't relevant to whether the pattern should match or not.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61289/new/

https://reviews.llvm.org/D61289