<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - C++ NEON intrinsics code using arrays of NEON variables is compiled to inefficient code"
href="https://bugs.llvm.org/show_bug.cgi?id=34945">34945</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>C++ NEON intrinsics code using arrays of NEON variables is compiled to inefficient code
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: AArch64
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>jacob.benoit.1@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>echristo@gmail.com, jan.wassenberg@gmail.com, llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=19274" name="attach_19274" title="Testcase">attachment 19274</a> <a href="attachment.cgi?id=19274&action=edit" title="Testcase">[details]</a></span>
Testcase
At least with the LLVM 5.0 toolchain in Android NDK r15c (in fact with each
recent NDK LLVM I've tried), when compiling to Aarch64, C++ NEON intrinsics
code that uses arrays of NEON variables, like
```
#include <arm_neon.h>
int32x4_t foo[4];
// This for loop is unrolled by the compiler.
// Manually unrolling it does not make a difference.
for (int i = 0; i < 4; i++) do_something(foo[i]);
```
is slow; rewriting this code to declare separate variables instead of an array
makes it much faster, e.g.
```
#include <arm_neon.h>
int32x4_t foo0, foo1, foo2, foo3;
// Now we have no choice but to manually unroll this code,
// as we don't have our 4 variables nicely tucked into an array.
do_something(foo0);
do_something(foo1);
do_something(foo2);
do_something(foo3);
```
I learned that trick from Jan Wassenberg (CC'd). It seems very surprising that
this would make any difference at all.
Attaching a self-contained testcase. It's not a minimal testcase, but it allows
to quantify the impact of this bug on concrete production code
(<a href="https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc">https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc</a>),
and it should be trivial to extract a minimal testcase looking like the above
snippets from it, or write one from scratch.
Example compilation command line:
aarch64-linux-android-clang++ -fPIE -static --std=c++11 -O3 simd-testcase.cc -o
/tmp/x
Example outputs:
Pixel2 big cores, ARM Cortex-A73:
```
gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables 14 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables 21.8 Gop/s
gemm_kernel_inline_asm 26.8 Gop/s
```
Pixel2 little cores, ARM Cortex-A53:
```
gemm_kernel_intrinsics_naive_using_arrays_of_neon_variables 5.27 Gop/s
gemm_kernel_intrinsics_fast_using_separate_neon_variables 10.3 Gop/s
gemm_kernel_inline_asm 11.6 Gop/s
```</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>