[llvm-bugs] [Bug 34945] New: C++ NEON intrinsics code using arrays of NEON variables is compiled to inefficient code

via llvm-bugs llvm-bugs at lists.llvm.org
Fri Oct 13 19:27:07 PDT 2017


https://bugs.llvm.org/show_bug.cgi?id=34945

            Bug ID: 34945
           Summary: C++ NEON intrinsics code using arrays of NEON
                    variables is compiled to inefficient code
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: AArch64
          Assignee: unassignedbugs at nondot.org
          Reporter: jacob.benoit.1 at gmail.com
                CC: echristo at gmail.com, jan.wassenberg at gmail.com,
                    llvm-bugs at lists.llvm.org

Created attachment 19274
  --> https://bugs.llvm.org/attachment.cgi?id=19274&action=edit
Testcase

At least with the LLVM 5.0 toolchain in Android NDK r15c (in fact with each
recent NDK LLVM I've tried), when compiling to Aarch64, C++ NEON intrinsics
code that uses arrays of NEON variables, like

```
#include <arm_neon.h>
int32x4_t foo[4];
// This for loop is unrolled by the compiler.
// Manually unrolling it does not make a difference.
for (int i = 0; i < 4; i++) do_something(foo[i]);
```

is slow; rewriting this code to declare separate variables instead of an array
makes it much faster, e.g.

```
#include <arm_neon.h>
int32x4_t foo0, foo1, foo2, foo3;
// Now we have no choice but to manually unroll this code,
// as we don't have our 4 variables nicely tucked into an array.
do_something(foo0);
do_something(foo1);
do_something(foo2);
do_something(foo3);
```

I learned that trick from Jan Wassenberg (CC'd). It seems very surprising that
this would make any difference at all.

Attaching a self-contained testcase. It's not a minimal testcase, but it allows
to quantify the impact of this bug on concrete production code
(https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc),
and it should be trivial to extract a minimal testcase looking like the above
snippets from it, or write one from scratch.

Example compilation command line:

aarch64-linux-android-clang++ -fPIE -static --std=c++11 -O3 simd-testcase.cc -o
/tmp/x

Example outputs:

Pixel2 big cores, ARM Cortex-A73:

```
gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables      14 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables        21.8 Gop/s
gemm_kernel_inline_asm                                           26.8 Gop/s
```

Pixel2 little cores, ARM Cortex-A53:

```
gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables      5.27 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables        10.3 Gop/s
gemm_kernel_inline_asm                                           11.6 Gop/s
```

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171014/695a0dbf/attachment.html>


More information about the llvm-bugs mailing list