<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/117557>117557</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Over-eager SIMD causing up to 6x regression
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
BlobTheKat
</td>
</tr>
</table>
<pre>
# Part 1
```cpp
#include <math.h>
#include <bit>
using namespace std;
volatile float accumulate = 1.;
int main(){
for(int i=0;i<1000000000;i++){
float r = accumulate;
r = abs(r);
// or indeed
//r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7ffffff);
accumulate = r;
}
}
```
The loop is partially unrolled but the load/abs()/store is turned into
```x86asm
.LCPI0_0:
.long 0x7fffffff
.long 0x7fffffff
.long 0x7fffffff
.long 0x7fffffff
.main
...
movss xmm1, dword ptr [rip + x]
andps xmm1, xmm0
movss dword ptr [rip + x], xmm1
...
```
The compiler attempts to make use of SIMD but it ends up being much slower. On my machine the loop completes in 2.72s on average
In contrast, replacing `0x7ffffff` with `0x7fffffe` results in much nicer assembly:
```x86asm
; LCPI0_0 is gone
.main
...
and dword ptr [rip + x], 2147483646
...
```
On my machine the loop completes in 450ms on average
# Part 2
Let's say we perform the volatile load/store outside the loop
```cpp
#include <math.h>
#include <bit>
using namespace std;
volatile float accumulate = 1.;
int main(){
float r = accumulate;
for(int i=0;i<1000000000;i++){
r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7ffffff);
}
accumulate = r;
}
```
The loop is unrolled 8x and compiled into
```x86asm
.LCPI0_0:
.long 0x7fffffff
.long 0x7fffffff
.long 0x7fffffff
.long 0x7fffffff
.main
; xmm1 = xmm0 = accumulate
...
andps xmm0, xmm1
...
; accumulate = xmm0
```
Completion time: 270ms
Now with the modified version, as earlier
```cpp
#include <math.h>
#include <bit>
using namespace std;
volatile float accumulate = 1.;
int main(){
float r = accumulate;
for(int i=0;i<1000000000;i++){
r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7fffffe);
}
accumulate = r;
}
```
The above is completely optimized away into
```x86asm
main:
and dword ptr [rip + x], 2147483646
xor eax, eax
ret
x:
.long 0x3f800000
```
In the first case the compiler attempts to use SIMD instructions where not necessary, creating an additional data section in the process
In the second case the compiler does that plus fails to optimize a loop whose body is dependent on a volatile, despite the volatile not being accessed inside the loop.
**In both cases the compiler determines that it SIMD is faster if and only if the second operand to `&` is `0x7ffffff`, which is completely wrong as only one value is being modified, no SIMD is needed**
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsV8mO4zYQ_Rr6UmhBorwefLDdMdDIJDNA5j6gyLLFhCIFkvKSrw-Kkttu95IggwBziCBYMousja9eUSIEvbeISzZZs8njSHSxdn65Nq76WuPPIo4qp85Lxkv4InyEguUruqd5f8u2pb-81FaaTiGwctOIWGc1K396Jal07Ie7oO0erGgwtEIihKhYue51H5wRURuEnXEigpCyazojIml4hCLrJ2oboRHaMj5nfMFm_eLFznnG5yTUrHzMWbnWrNwU-eVKA3yd7usqWpiM-WTjavLiFM0YZFVgfO5p9Y2M8S3jW3AetFWI6k7QL610_CZFiKzcJGuUCj6_Ge1s2gwF2g5CsgOMTyE_zXbpujN8lx1_lc0e6aX_vewWy1dfawTjXAs6QCt81MKYM3TWO2NQQdVFiGmKUIxv-2jJJt-G6DzSstj5wUl3q_w0n4rQsHyVfdp8ecq_5axc9b5kxtk9wDWI3e3426Pvzb4bzxIGenmW9S-NO4QAp6YpGN-AOjqvoI0e2GTtdQuMr-HEJo_9ZGFVGwCep5-aJr9VAx8o6KcXL8zfpVq6ptUGPYgYsWljgOigEX8gdAHB7eC3p18eU9Z1BLQqQNdChVQdTSdrCMYd0Wfw2UJzhkbIWlscdsi1Sb3BiAG0BZ7NeABnQRzQiz325fRkQTobPUGMb8Bja4Qk_WyaX1E1zeGoY307iDToMXQmJvXJH6slBRMCNpU5Dzv8CgGsXMMAAgLM3ll8Z6-EVR8nmBfj2XheTsfT99L8T1IznuTN69Q8sxrv_3_CyPgsQBBnOCK06HfON0nlMykNhdEXg-ti0Opq9Ecnx4857l9z539Fb7NLjb5Nch9y2zOjzU9AIBsK8QejLRguKhiikhQfUdD9Hr2qmgtp5W-x0K3au-Rd-O0mbZu-UrSzEHWDrFwBn-VN6BH3qzv21EAgb5zSO40KDuiDdpaMiwAovNHo_8f-92Afvxv7onKH1KEv3GfO4NqoG_0nKhBHcX4f_SllCfmpXODvGt8LXr7g7eR8eqI40Rx65CvwGFm-Og3ah-taFuVu3qf4ZUBPNiFup32IIEXoWfbNfkqtNPVRbUP0nSQoBzjW6BGsi2BRYgjCn8kp6VFEApmwIJTSNFkYUCIKCJjWUscgY613tPDqTEDpiEteeaMcBoi1iNCaLsBOaJM8u2QfRE9Mx9oFBDpS0zYpbNEqtDF1pucek04tGFod8WXvoVj6s4GQ5FiC04sGlF0aG91PFioX6-RuuPMXI_pG24vXOg4JJN9DRA96l3DgrDnT-030rkVPoujotMD4lM4JOtyfJyiKY61lfQfIo6eNF6FX7SzCQZguwXY49wwcQwqse_bLIioapMhGalmqRbkQI1wWs5JPxsV0VozqpSqmxVRVZcULWcmF4tUO1WKxkOVkXo4FH-klz_m4KPikKMtxWWRS5rsFX8hqUYpqxsdsnGMjtMmMOTSZ8_uRDqHDZVHMJpPZyIgKTUhfS5xbPEKSMs7p48kvadFD1e0DG-dGhxiuaqKOBpefD-gfUOzR94FJ0TNe11I-pyfwuPcYiFtHnTfLOsY2UOWk74i9jnVXZdI1jG9J8_B4aL37HWVkfJv8CYxvB4cPS_5XAAAA___gwTM6">