[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Evgeny Astigeevich via llvm-dev llvm-dev at lists.llvm.org
Mon Jan 22 10:49:51 PST 2018


Hi Sanjay,

Thank you for your help.

I have looked at one of Cortex-M regressions David talked about. The patch improves IR:

====Before:
%conv10 = zext i8 %40 to i32
%xor = xor i32 %conv10, %conv14
%conv16 = trunc i32 %xor to i8
store i8 %conv16, i8* %storemerge185, align 1

====After:
%xor = xor i8 %40, %conv14
store i8 %xor, i8* %storemerge185, align 1

However it affects register allocation causing more spill/fills. I am trying to figure out why it happens.

Thanks,
Evgeny Astigeevich

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sanjay Patel via llvm-dev <llvm-dev at lists.llvm.org>
Reply-To: Sanjay Patel <spatel at rotateright.com>
Date: Monday, 22 January 2018 at 18:00
To: David Green <David.Green at arm.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>, nd <nd at arm.com>
Subject: Re: [llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Thanks for the perf testing. I assume that DAG legalization is equipped to handle these cases fairly well, or someone would've complained by now...
FWIW (and at least some of this can be blamed on me), instcombine already does the narrowing transforms without checking shouldChangeType() for binops like and/or/xor/udiv. The justification was that narrower ops are always better for bit-tracking. If we can eliminate casts, then that improves analysis even more and allows more follow-on transforms.
If there are no objections here, I think you can post your patch for review on Phabricator. If we can squash any of the regressions before that goes in, that would be even better.

On Mon, Jan 22, 2018 at 3:10 AM, David Green <David.Green at arm.com<mailto:David.Green at arm.com>> wrote:
Hello

Thanks for looking into this.

I can't be very confident what the knock on result of a change like that would be,
especially on architectures that are not Arm. What I can do though, is run some
benchmarks and look at that results.

Using this patch:

--- a/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned FromWidth,
   bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
   bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);

+  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16))
+    return true;
+
   // If this is a legal integer from type, and the result would be an illegal
   // type, don't do the transformation.
   if (FromLegal && !ToLegal)


Running on a little A core, in the llvm test suite I am seeing these changes:

MultiSource/Benchmarks/BitBench/uudecode/uudecode
        3.38%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
        -35.04%
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
        -17.92%
SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant
        -8.57%
External/SPEC/CINT2000/253.perlbmk/253.perlbmk
        -3.43%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
        -3.36%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
        -1.34%

+ve for these is bad, -ve is good. So overall looks like a good change, especially in
simple_types_constant_folding. There may be some alignment issues that can
causing wilder swings than they should, but the results here look good. The list for
aarch64 is roughly the same, just a slightly longer list of minor improvements.

On our internal cortex-m tests we are seeing more regressions but it's still a net
positive in most cases.

I would say that at least for these results, it looks like a profitable idea. Like I said
I can't be sure about other architectures though.
Dave

________________________________________
From: Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>>
Sent: 17 January 2018 22:50
To: llvm-dev
Cc: David Green
Subject: always allow canonicalizing to 8- and 16-bit ops?
Example:
define i8 @narrow_add(i8 %x, i8 %y) {
  %x32 = zext i8 %x to i32
  %y32 = zext i8 %y to i32
  %add = add nsw i32 %x32, %y32
  %tr = trunc i32 %add to i8
  ret i8 %tr
}

With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to:

$ ./opt -instcombine narrowadd.ll -S
define i8 @narrow_add(i8 %x, i8 %y) {
  %add = add i8 %x, %y
  ret i8 %add
}

But on a target that has 32-bit registers without explicit subregister ops, we don't do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::shouldChangeType().

Should we make an exception to allow narrowing for the common cases of i8 and i16?

In the motivating example from PR35875 ( https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions:

declare void @use4(i8, i8, i8, i8)
define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %zx = zext i8 %nx to i32
  %zy = zext i8 %ny to i32
  %zz = zext i8 %nz to i32

  %cmpxz = icmp ult i32 %zx, %zz
  %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
  %cmpyz = icmp ult i32 %zy, %zz
  %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
  %cmpyx = icmp ult i8 %y, %x
  %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
  %tr_minxyz = trunc i32 %minxyz to i8

  %new_zx = sub nsw i32 %zx, %minxyz
  %new_zy = sub nsw i32 %zy, %minxyz
  %new_zz = sub nsw i32 %zz, %minxyz
  %new_x = trunc i32 %new_zx to i8
  %new_y = trunc i32 %new_zy to i8
  %new_z = trunc i32 %new_zz to i8

  call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

...but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse:

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %cmpxz = icmp ult i8 %nx, %nz
  %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
  %1 = icmp ult i8 %minxz, %ny
  %minxyz = select i1 %1, i8 %minxz, i8 %ny
  %new_x = sub i8 %nx, %minxyz
  %new_y = sub i8 %ny, %minxyz
  %new_z = sub i8 %nz, %minxyz

  call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180122/a9e9a68b/attachment-0001.html>


More information about the llvm-dev mailing list