[llvm-bugs] [Bug 38691] New: addus/subus-as-native IR can be defeated by optimizer

Fri Aug 24 10:53:12 PDT 2018

https://bugs.llvm.org/show_bug.cgi?id=38691

            Bug ID: 38691
           Summary: addus/subus-as-native IR can be defeated by optimizer
           Product: new-bugs
           Version: trunk
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: srj at google.com
                CC: llvm-bugs at lists.llvm.org

The revision from patch https://reviews.llvm.org/D46179#1211902 (Lowering
addus/subus intrinsics to native IR) requires that IR be emitted in certain
patterns in order to produce paddus/psubus instructions; however, it's not hard
to emit IR patterns that the LLVM optimizer can rearrange such that the
instructions won't be produced, and instead have a much slower combination of
instructions generated. 

For example, if user code assembles a vector from smaller pieces (e.g., on
sse2, by loading two 8-byte halves rather than a single 16-byte whole), code
might have formerly been something like:

```
  # Do a saturating unsigned add on two <8 x i8> vectors, 
  # then widen to an <8 x i32> result
  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %24 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %22, <16 x i8> %23) #5
  %25 = shufflevector <16 x i8> %24, <16 x i8> undef, <8 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```

To work with this patch, I revised my project's code to emit inline code that
should pattern-match properly (based on the new self-tests for the IR),
something like:

```
  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  # Here's the inline pattern that should match paddusb
  %24 = add <16 x i8> %22, %23
  %25 = icmp ugt <16 x i8> %22, %24
  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8
-1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16
x i8> %24
  #
  %25 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```

And, in fact, if I don't use any optimizer passes, this works perfectly.
Unfortunately, the LLVM optimizer passes can do some rearranging of this, e.g.
into a form something like this:

```
  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = add <8 x i8> %20, %16
  %23 = shufflevector <8 x i8> %22, <8 x i8> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %24 = icmp ult <8 x i8> %22, %20
  %25 = shufflevector <8 x i1> %24, <8 x i1> undef, <16 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32
undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8
-1, i8 -1, i8 -1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef,
i8 undef, i8 undef>, <16 x i8> %23
  %27 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1,
i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```

...which no longer gets recognized as a pattern that produces paddusb, since
the select no longer refers directly to the result of the compare (but rather
to an intermediate shuffle).

Either the recognizer needs to be smarter about this, or there needs to be an
explicit way to emit code that is guaranteed to produce the expected
instruction(s).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180824/dc0b337d/attachment.html>