[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Mon Jan 22 10:00:15 PST 2018

Thanks for the perf testing. I assume that DAG legalization is equipped to
handle these cases fairly well, or someone would've complained by now...

FWIW (and at least some of this can be blamed on me), instcombine already
does the narrowing transforms without checking shouldChangeType() for
binops like and/or/xor/udiv. The justification was that narrower ops are
always better for bit-tracking. If we can eliminate casts, then that
improves analysis even more and allows more follow-on transforms.

If there are no objections here, I think you can post your patch for review
on Phabricator. If we can squash any of the regressions before that goes
in, that would be even better.

On Mon, Jan 22, 2018 at 3:10 AM, David Green <David.Green at arm.com> wrote:

> Hello
>
> Thanks for looking into this.
>
> I can't be very confident what the knock on result of a change like that
> would be,
> especially on architectures that are not Arm. What I can do though, is run
> some
> benchmarks and look at that results.
>
> Using this patch:
>
> --- a/lib/Transforms/InstCombine/InstructionCombining.cpp
> +++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
> @@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned
> FromWidth,
>    bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
>    bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);
>
> +  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16))
> +    return true;
> +
>    // If this is a legal integer from type, and the result would be an
> illegal
>    // type, don't do the transformation.
>    if (FromLegal && !ToLegal)
>
>
> Running on a little A core, in the llvm test suite I am seeing these
> changes:
>
> MultiSource/Benchmarks/BitBench/uudecode/uudecode
>         3.38%
> SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
>         -35.04%
> MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
>         -17.92%
> SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant
>         -8.57%
> External/SPEC/CINT2000/253.perlbmk/253.perlbmk
>         -3.43%
> MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
>         -3.36%
> MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
>         -1.34%
>
> +ve for these is bad, -ve is good. So overall looks like a good change,
> especially in
> simple_types_constant_folding. There may be some alignment issues that can
> causing wilder swings than they should, but the results here look good.
> The list for
> aarch64 is roughly the same, just a slightly longer list of minor
> improvements.
>
> On our internal cortex-m tests we are seeing more regressions but it's
> still a net
> positive in most cases.
>
> I would say that at least for these results, it looks like a profitable
> idea. Like I said
> I can't be sure about other architectures though.
> Dave
>
> ________________________________________
> From: Sanjay Patel <spatel at rotateright.com>
> Sent: 17 January 2018 22:50
> To: llvm-dev
> Cc: David Green
> Subject: always allow canonicalizing to 8- and 16-bit ops?
>
> Example:
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %x32 = zext i8 %x to i32
>   %y32 = zext i8 %y to i32
>   %add = add nsw i32 %x32, %y32
>   %tr = trunc i32 %add to i8
>   ret i8 %tr
> }
>
> With no data-layout or with an x86 target where 8-bit integer is in the
> data-layout, we reduce to:
>
> $ ./opt -instcombine narrowadd.ll -S
> define i8 @narrow_add(i8 %x, i8 %y) {
>   %add = add i8 %x, %y
>   ret i8 %add
> }
>
> But on a target that has 32-bit registers without explicit subregister
> ops, we don't do that transform because we avoid changing operations from a
> legal (as specified in the data-layout) width to an illegal width - see
> InstCombiner::shouldChangeType().
>
> Should we make an exception to allow narrowing for the common cases of i8
> and i16?
>
> In the motivating example from PR35875 ( https://bugs.llvm.org/show_
> bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions:
>
> declare void @use4(i8, i8, i8, i8)
> define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
>   %nx = xor i8 %x, -1
>   %ny = xor i8 %y, -1
>   %nz = xor i8 %z, -1
>   %zx = zext i8 %nx to i32
>   %zy = zext i8 %ny to i32
>   %zz = zext i8 %nz to i32
>
>   %cmpxz = icmp ult i32 %zx, %zz
>   %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
>   %cmpyz = icmp ult i32 %zy, %zz
>   %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
>   %cmpyx = icmp ult i8 %y, %x
>   %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
>   %tr_minxyz = trunc i32 %minxyz to i8
>
>   %new_zx = sub nsw i32 %zx, %minxyz
>   %new_zy = sub nsw i32 %zy, %minxyz
>   %new_zz = sub nsw i32 %zz, %minxyz
>   %new_x = trunc i32 %new_zx to i8
>   %new_y = trunc i32 %new_zy to i8
>   %new_z = trunc i32 %new_zz to i8
>
>   call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
>   ret void
> }
>
> ...but x86 gets to shrink the subs which leads to a bunch of other
> transforms, and we grind this down to 10 instructions between instcombine
> and early-cse:
>
> define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
>   %nx = xor i8 %x, -1
>   %ny = xor i8 %y, -1
>   %nz = xor i8 %z, -1
>   %cmpxz = icmp ult i8 %nx, %nz
>   %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
>   %1 = icmp ult i8 %minxz, %ny
>   %minxyz = select i1 %1, i8 %minxz, i8 %ny
>   %new_x = sub i8 %nx, %minxyz
>   %new_y = sub i8 %ny, %minxyz
>   %new_z = sub i8 %nz, %minxyz
>
>   call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
>   ret void
> }
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180122/622c0ac9/attachment.html>