[llvm-dev] always allow canonicalizing to 8- and 16-bit ops?

Mon Jan 22 02:10:04 PST 2018

Hello

Thanks for looking into this.

I can't be very confident what the knock on result of a change like that would be,
especially on architectures that are not Arm. What I can do though, is run some
benchmarks and look at that results.

Using this patch:

--- a/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned FromWidth,
   bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
   bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);
 
+  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16))
+    return true;
+
   // If this is a legal integer from type, and the result would be an illegal
   // type, don't do the transformation.
   if (FromLegal && !ToLegal)


Running on a little A core, in the llvm test suite I am seeing these changes:

MultiSource/Benchmarks/BitBench/uudecode/uudecode
        3.38%
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
        -35.04%
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
        -17.92%
SingleSource/Benchmarks/Adobe-C++/simple_types_loop_invariant
        -8.57%
External/SPEC/CINT2000/253.perlbmk/253.perlbmk
        -3.43%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm
        -3.36%
MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl
        -1.34%

+ve for these is bad, -ve is good. So overall looks like a good change, especially in
simple_types_constant_folding. There may be some alignment issues that can
causing wilder swings than they should, but the results here look good. The list for
aarch64 is roughly the same, just a slightly longer list of minor improvements.

On our internal cortex-m tests we are seeing more regressions but it's still a net
positive in most cases.

I would say that at least for these results, it looks like a profitable idea. Like I said
I can't be sure about other architectures though.
Dave

________________________________________
From: Sanjay Patel <spatel at rotateright.com>
Sent: 17 January 2018 22:50
To: llvm-dev
Cc: David Green
Subject: always allow canonicalizing to 8- and 16-bit ops?

Example:
define i8 @narrow_add(i8 %x, i8 %y) {
  %x32 = zext i8 %x to i32
  %y32 = zext i8 %y to i32
  %add = add nsw i32 %x32, %y32
  %tr = trunc i32 %add to i8
  ret i8 %tr
}

With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to:

$ ./opt -instcombine narrowadd.ll -S
define i8 @narrow_add(i8 %x, i8 %y) {
  %add = add i8 %x, %y
  ret i8 %add
}

But on a target that has 32-bit registers without explicit subregister ops, we don't do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::shouldChangeType().

Should we make an exception to allow narrowing for the common cases of i8 and i16?

In the motivating example from PR35875 ( https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions:

declare void @use4(i8, i8, i8, i8)
define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %zx = zext i8 %nx to i32
  %zy = zext i8 %ny to i32
  %zz = zext i8 %nz to i32

  %cmpxz = icmp ult i32 %zx, %zz
  %minxz = select i1 %cmpxz, i32 %zx, i32 %zz
  %cmpyz = icmp ult i32 %zy, %zz
  %minyz = select i1 %cmpyz, i32 %zy, i32 %zz
  %cmpyx = icmp ult i8 %y, %x
  %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
  %tr_minxyz = trunc i32 %minxyz to i8

  %new_zx = sub nsw i32 %zx, %minxyz
  %new_zy = sub nsw i32 %zy, %minxyz
  %new_zz = sub nsw i32 %zz, %minxyz
  %new_x = trunc i32 %new_zx to i8
  %new_y = trunc i32 %new_zy to i8
  %new_z = trunc i32 %new_zz to i8

  call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}

...but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse:

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
  %nx = xor i8 %x, -1
  %ny = xor i8 %y, -1
  %nz = xor i8 %z, -1
  %cmpxz = icmp ult i8 %nx, %nz
  %minxz = select i1 %cmpxz, i8 %nx, i8 %nz
  %1 = icmp ult i8 %minxz, %ny
  %minxyz = select i1 %1, i8 %minxz, i8 %ny
  %new_x = sub i8 %nx, %minxyz
  %new_y = sub i8 %ny, %minxyz
  %new_z = sub i8 %nz, %minxyz

  call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
  ret void
}