<div dir="ltr"><div><div><div>Thanks for the perf testing. I assume that DAG legalization is equipped to handle these cases fairly well, or someone would've complained by now...<br><br></div>FWIW (and at least some of this can be blamed on me), instcombine already does the narrowing transforms without checking shouldChangeType() for binops like and/or/xor/udiv. The justification was that narrower ops are always better for bit-tracking. If we can eliminate casts, then that improves analysis even more and allows more follow-on transforms.<br><br></div>If there are no objections here, I think you can post your patch for review on Phabricator. If we can squash any of the regressions before that goes in, that would be even better. <br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jan 22, 2018 at 3:10 AM, David Green <span dir="ltr"><<a href="mailto:David.Green@arm.com" target="_blank">David.Green@arm.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello<br>

<br>

Thanks for looking into this.<br>

<br>

I can't be very confident what the knock on result of a change like that would be,<br>

especially on architectures that are not Arm. What I can do though, is run some<br>

benchmarks and look at that results.<br>

<br>

Using this patch:<br>

<br>

--- a/lib/Transforms/InstCombine/<wbr>InstructionCombining.cpp<br>

+++ b/lib/Transforms/InstCombine/<wbr>InstructionCombining.cpp<br>

@@ -150,6 +150,9 @@ bool InstCombiner::<wbr>shouldChangeType(unsigned FromWidth,<br>

   bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);<br>

   bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);<br>

<br>

+  if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16))<br>

+    return true;<br>

+<br>

   // If this is a legal integer from type, and the result would be an illegal<br>

   // type, don't do the transformation.<br>

   if (FromLegal && !ToLegal)<br>

<br>

<br>

Running on a little A core, in the llvm test suite I am seeing these changes:<br>

<br>

MultiSource/Benchmarks/<wbr>BitBench/uudecode/uudecode<br>

        3.38%<br>

SingleSource/Benchmarks/Adobe-<wbr>C++/simple_types_constant_<wbr>folding<br>

        -35.04%<br>

MultiSource/Benchmarks/<wbr>Trimaran/enc-pc1/enc-pc1<br>

        -17.92%<br>

SingleSource/Benchmarks/Adobe-<wbr>C++/simple_types_loop_<wbr>invariant<br>

        -8.57%<br>

External/SPEC/CINT2000/253.<wbr>perlbmk/253.perlbmk<br>

        -3.43%<br>

MultiSource/Benchmarks/<wbr>MiBench/telecomm-gsm/telecomm-<wbr>gsm<br>

        -3.36%<br>

MultiSource/Benchmarks/TSVC/<wbr>CrossingThresholds-dbl/<wbr>CrossingThresholds-dbl<br>

        -1.34%<br>

<br>

+ve for these is bad, -ve is good. So overall looks like a good change, especially in<br>

simple_types_constant_folding. There may be some alignment issues that can<br>

causing wilder swings than they should, but the results here look good. The list for<br>

aarch64 is roughly the same, just a slightly longer list of minor improvements.<br>

<br>

On our internal cortex-m tests we are seeing more regressions but it's still a net<br>

positive in most cases.<br>

<br>

I would say that at least for these results, it looks like a profitable idea. Like I said<br>

I can't be sure about other architectures though.<br>

<span class="im HOEnZb">Dave<br>

<br>

______________________________<wbr>__________<br>

From: Sanjay Patel <<a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a>><br>

Sent: 17 January 2018 22:50<br>

</span><span class="im HOEnZb">To: llvm-dev<br>

Cc: David Green<br>

Subject: always allow canonicalizing to 8- and 16-bit ops?<br>

<br>

</span><div class="HOEnZb"><div class="h5">Example:<br>

define i8 @narrow_add(i8 %x, i8 %y) {<br>

  %x32 = zext i8 %x to i32<br>

  %y32 = zext i8 %y to i32<br>

  %add = add nsw i32 %x32, %y32<br>

  %tr = trunc i32 %add to i8<br>

  ret i8 %tr<br>

}<br>

<br>

With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to:<br>

<br>

$ ./opt -instcombine narrowadd.ll -S<br>

define i8 @narrow_add(i8 %x, i8 %y) {<br>

  %add = add i8 %x, %y<br>

  ret i8 %add<br>

}<br>

<br>

But on a target that has 32-bit registers without explicit subregister ops, we don't do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::<wbr>shouldChangeType().<br>

<br>

Should we make an exception to allow narrowing for the common cases of i8 and i16?<br>

<br>

In the motivating example from PR35875 ( <a href="https://bugs.llvm.org/show_bug.cgi?id=35875" rel="noreferrer" target="_blank">https://bugs.llvm.org/show_<wbr>bug.cgi?id=35875</a> ), an ARM target is stuck at 19 IR instructions:<br>

<br>

declare void @use4(i8, i8, i8, i8)<br>

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {<br>

  %nx = xor i8 %x, -1<br>

  %ny = xor i8 %y, -1<br>

  %nz = xor i8 %z, -1<br>

  %zx = zext i8 %nx to i32<br>

  %zy = zext i8 %ny to i32<br>

  %zz = zext i8 %nz to i32<br>

<br>

  %cmpxz = icmp ult i32 %zx, %zz<br>

  %minxz = select i1 %cmpxz, i32 %zx, i32 %zz<br>

  %cmpyz = icmp ult i32 %zy, %zz<br>

  %minyz = select i1 %cmpyz, i32 %zy, i32 %zz<br>

  %cmpyx = icmp ult i8 %y, %x<br>

  %minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz<br>

  %tr_minxyz = trunc i32 %minxyz to i8<br>

<br>

  %new_zx = sub nsw i32 %zx, %minxyz<br>

  %new_zy = sub nsw i32 %zy, %minxyz<br>

  %new_zz = sub nsw i32 %zz, %minxyz<br>

  %new_x = trunc i32 %new_zx to i8<br>

  %new_y = trunc i32 %new_zy to i8<br>

  %new_z = trunc i32 %new_zz to i8<br>

<br>

  call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)<br>

  ret void<br>

}<br>

<br>

...but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse:<br>

<br>

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {<br>

  %nx = xor i8 %x, -1<br>

  %ny = xor i8 %y, -1<br>

  %nz = xor i8 %z, -1<br>

  %cmpxz = icmp ult i8 %nx, %nz<br>

  %minxz = select i1 %cmpxz, i8 %nx, i8 %nz<br>

  %1 = icmp ult i8 %minxz, %ny<br>

  %minxyz = select i1 %1, i8 %minxz, i8 %ny<br>

  %new_x = sub i8 %nx, %minxyz<br>

  %new_y = sub i8 %ny, %minxyz<br>

  %new_z = sub i8 %nz, %minxyz<br>

<br>

  call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)<br>

  ret void<br>

}<br>

<br>

</div></div></blockquote></div><br></div>