[llvm-bugs] [Bug 46888] New: AArch64 unneccessary widening lowers vector performance

Wed Jul 29 03:31:35 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=46888

            Bug ID: 46888
           Summary: AArch64 unneccessary widening lowers vector
                    performance
           Product: tools
           Version: trunk
          Hardware: Other
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: opt
          Assignee: unassignedbugs at nondot.org
          Reporter: joel.hutton at arm.com
                CC: llvm-bugs at lists.llvm.org

for the following test snippet:

include <math.h>

int arrSum(unsigned char a1, int inc_a1, unsigned char a2,
                                int inc_a2) {
  int sum = 0;
  for (int y = 0; y < 16; y++) {
    for (int x = 0; x < 16; x++) {
      sum += abs(a1[x] - a2[x]);
    }
    a1 += inc_a1;
    a2 += inc_a2;
  }
  return sum;
}

using clang -O3:

LLVM trunk widens bytes to 16 bit halfs with ushll instructions and then takes
the absolute differences of the halfs.

    sxtw    x8, w3
    ldr     q4, [x0]
    ldr     q5, [x2]
    add     x10, x0, x9
    add     x11, x2, x8
    ldr     q0, [x10]
    ldr     q1, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ldr     q2, [x10]
    ldr     q3, [x11]
    add     x10, x10, x9
    add     x11, x11, x8
    ushll   v7.8h, v4.8b, #0
    ushll2  v16.8h, v4.16b, #0
    ushll   v17.8h, v5.8b, #0
    ushll2  v5.8h, v5.16b, #0
    ldr     q4, [x10]
    ldr     q6, [x11]
    uabdl   v18.4s, v16.4h, v5.4h
    uabdl   v19.4s, v7.4h, v17.4h
    uabdl2  v16.4s, v16.8h, v5.8h
    uabdl2  v17.4s, v7.8h, v17.8h

This is wider than necessary and the difference operation can be performed on
the bytes directly. Performing the operation directly on bytes processes 8
lanes at a time instead of 4 as well as avoiding unneccesary shifts. This can
be seen in the GCC codegen.

For gcc (trunk) -O3:

    sxtw    x2, w3
    sxtw    x3, w1
    add     x11, x5, x2
    ldr     q1, [x4]
    add     x9, x11, x2
    ldr     q5, [x5]
    add     x7, x9, x2
    ldr     q3, [x4, w1, sxtw]
    add     x4, x4, x3
    uabdl2  v2.8h, v1.16b, v5.16b
    add     x10, x4, x3
    ldr     q4, [x5, w2, sxtw]
    add     x8, x10, x3
    movi    v0.4s, 0
    add     x6, x8, x3
    uabal   v2.8h, v1.8b, v5.8b
    add     x5, x7, x2
    uabdl2  v1.8h, v3.16b, v4.16b
    add     x19, x5, x2
    ldr     q6, [x11, w2, sxtw]
    mov     x0, x2
    ldr     q5, [x4, w1, sxtw]
    add     x4, x6, x3
    uabal   v1.8h, v3.8b, v4.8b
    add     x30, x4, x3
    uadalp  v0.4s, v2.8h
    add     x18, x19, x2
    uabdl2  v2.8h, v5.16b, v6.16b

This affects SPEC2017 x264 performance.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200729/53b0e674/attachment-0001.html>