<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - addus/subus-as-native IR can be defeated by optimizer"

   href="https://bugs.llvm.org/show_bug.cgi?id=38691">38691</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>addus/subus-as-native IR can be defeated by optimizer

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>srj@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>The revision from patch <a href="https://reviews.llvm.org/D46179#1211902">https://reviews.llvm.org/D46179#1211902</a> (Lowering

addus/subus intrinsics to native IR) requires that IR be emitted in certain

patterns in order to produce paddus/psubus instructions; however, it's not hard

to emit IR patterns that the LLVM optimizer can rearrange such that the

instructions won't be produced, and instead have a much slower combination of

instructions generated. 

For example, if user code assembles a vector from smaller pieces (e.g., on

sse2, by loading two 8-byte halves rather than a single 16-byte whole), code

might have formerly been something like:

```

  # Do a saturating unsigned add on two <8 x i8> vectors, 

  # then widen to an <8 x i32> result

  %20 = load <8 x i8>

  %21 = load <8 x i8>

  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %24 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %22, <16 x i8> %23) #5

  %25 = shufflevector <16 x i8> %24, <16 x i8> undef, <8 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

```

To work with this patch, I revised my project's code to emit inline code that

should pattern-match properly (based on the new self-tests for the IR),

something like:

```

  %20 = load <8 x i8>

  %21 = load <8 x i8>

  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  # Here's the inline pattern that should match paddusb

  %24 = add <16 x i8> %22, %23

  %25 = icmp ugt <16 x i8> %22, %24

  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8

-1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16

x i8> %24

  #

  %25 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

```

And, in fact, if I don't use any optimizer passes, this works perfectly.

Unfortunately, the LLVM optimizer passes can do some rearranging of this, e.g.

into a form something like this:

```

  %20 = load <8 x i8>

  %21 = load <8 x i8>

  %22 = add <8 x i8> %20, %16

  %23 = shufflevector <8 x i8> %22, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %24 = icmp ult <8 x i8> %22, %20

  %25 = shufflevector <8 x i1> %24, <8 x i1> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8

-1, i8 -1, i8 -1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef,

i8 undef, i8 undef>, <16 x i8> %23

  %27 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

```

...which no longer gets recognized as a pattern that produces paddusb, since

the select no longer refers directly to the result of the compare (but rather

to an intermediate shuffle).

Either the recognizer needs to be smarter about this, or there needs to be an

explicit way to emit code that is guaranteed to produce the expected

instruction(s).</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>