<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/117557>117557</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Over-eager SIMD causing up to 6x regression

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          BlobTheKat

      </td>

    </tr>

</table>

<pre>

    # Part 1

```cpp

#include <math.h>

#include <bit>

using namespace std;

volatile float accumulate = 1.;

int main(){

        for(int i=0;i<1000000000;i++){

                float r = accumulate;

                r = abs(r);

                // or indeed

                //r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7ffffff);

                accumulate = r;

        }

}

```

The loop is partially unrolled but the load/abs()/store is turned into

```x86asm

.LCPI0_0:

        .long   0x7fffffff

        .long 0x7fffffff

        .long   0x7fffffff

        .long   0x7fffffff

.main

        ...

        movss xmm1, dword ptr [rip + x]

        andps   xmm1, xmm0

        movss   dword ptr [rip + x], xmm1

        ...

```

The compiler attempts to make use of SIMD but it ends up being much slower. On my machine the loop completes in 2.72s on average

In contrast, replacing `0x7ffffff` with `0x7fffffe` results in much nicer assembly:

```x86asm

; LCPI0_0 is gone

.main

        ...

        and dword ptr [rip + x], 2147483646

        ...

```

On my machine the loop completes in 450ms on average

# Part 2

Let's say we perform the volatile load/store outside the loop

```cpp

#include <math.h>

#include <bit>

using namespace std;

volatile float accumulate = 1.;

int main(){

        float r = accumulate;

        for(int i=0;i<1000000000;i++){

                r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7ffffff);

        }

        accumulate = r;

}

```

The loop is unrolled 8x and compiled into

```x86asm

.LCPI0_0:

        .long   0x7fffffff

        .long 0x7fffffff

        .long   0x7fffffff

        .long   0x7fffffff

.main

        ; xmm1 = xmm0 = accumulate

        ...

        andps   xmm0, xmm1

        ...

        ; accumulate = xmm0

```

Completion time: 270ms

Now with the modified version, as earlier

```cpp

#include <math.h>

#include <bit>

using namespace std;

volatile float accumulate = 1.;

int main(){

        float r = accumulate;

        for(int i=0;i<1000000000;i++){

                r = bit_cast<float>(bit_cast<unsigned int>(r) & 0x7fffffe);

        }

        accumulate = r;

}

```

The above is completely optimized away into

```x86asm

main:

 and     dword ptr [rip + x], 2147483646

        xor     eax, eax

 ret

x:

        .long   0x3f800000

```

In the first case the compiler attempts to use SIMD instructions where not necessary, creating an additional data section in the process

In the second case the compiler does that plus fails to optimize a loop whose body is dependent on a volatile, despite the volatile not being accessed inside the loop.

**In both cases the compiler determines that it SIMD is faster if and only if the second operand to `&` is `0x7ffffff`, which is completely wrong as only one value is being modified, no SIMD is needed**

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsV8mO4zYQ_Rr6UmhBorwefLDdMdDIJDNA5j6gyLLFhCIFkvKSrw-Kkttu95IggwBziCBYMousja9eUSIEvbeISzZZs8njSHSxdn65Nq76WuPPIo4qp85Lxkv4InyEguUruqd5f8u2pb-81FaaTiGwctOIWGc1K396Jal07Ie7oO0erGgwtEIihKhYue51H5wRURuEnXEigpCyazojIml4hCLrJ2oboRHaMj5nfMFm_eLFznnG5yTUrHzMWbnWrNwU-eVKA3yd7usqWpiM-WTjavLiFM0YZFVgfO5p9Y2M8S3jW3AetFWI6k7QL610_CZFiKzcJGuUCj6_Ge1s2gwF2g5CsgOMTyE_zXbpujN8lx1_lc0e6aX_vewWy1dfawTjXAs6QCt81MKYM3TWO2NQQdVFiGmKUIxv-2jJJt-G6DzSstj5wUl3q_w0n4rQsHyVfdp8ecq_5axc9b5kxtk9wDWI3e3426Pvzb4bzxIGenmW9S-NO4QAp6YpGN-AOjqvoI0e2GTtdQuMr-HEJo_9ZGFVGwCep5-aJr9VAx8o6KcXL8zfpVq6ptUGPYgYsWljgOigEX8gdAHB7eC3p18eU9Z1BLQqQNdChVQdTSdrCMYd0Wfw2UJzhkbIWlscdsi1Sb3BiAG0BZ7NeABnQRzQiz325fRkQTobPUGMb8Bja4Qk_WyaX1E1zeGoY307iDToMXQmJvXJH6slBRMCNpU5Dzv8CgGsXMMAAgLM3ll8Z6-EVR8nmBfj2XheTsfT99L8T1IznuTN69Q8sxrv_3_CyPgsQBBnOCK06HfON0nlMykNhdEXg-ti0Opq9Ecnx4857l9z539Fb7NLjb5Nch9y2zOjzU9AIBsK8QejLRguKhiikhQfUdD9Hr2qmgtp5W-x0K3au-Rd-O0mbZu-UrSzEHWDrFwBn-VN6BH3qzv21EAgb5zSO40KDuiDdpaMiwAovNHo_8f-92Afvxv7onKH1KEv3GfO4NqoG_0nKhBHcX4f_SllCfmpXODvGt8LXr7g7eR8eqI40Rx65CvwGFm-Og3ah-taFuVu3qf4ZUBPNiFup32IIEXoWfbNfkqtNPVRbUP0nSQoBzjW6BGsi2BRYgjCn8kp6VFEApmwIJTSNFkYUCIKCJjWUscgY613tPDqTEDpiEteeaMcBoi1iNCaLsBOaJM8u2QfRE9Mx9oFBDpS0zYpbNEqtDF1pucek04tGFod8WXvoVj6s4GQ5FiC04sGlF0aG91PFioX6-RuuPMXI_pG24vXOg4JJN9DRA96l3DgrDnT-030rkVPoujotMD4lM4JOtyfJyiKY61lfQfIo6eNF6FX7SzCQZguwXY49wwcQwqse_bLIioapMhGalmqRbkQI1wWs5JPxsV0VozqpSqmxVRVZcULWcmF4tUO1WKxkOVkXo4FH-klz_m4KPikKMtxWWRS5rsFX8hqUYpqxsdsnGMjtMmMOTSZ8_uRDqHDZVHMJpPZyIgKTUhfS5xbPEKSMs7p48kvadFD1e0DG-dGhxiuaqKOBpefD-gfUOzR94FJ0TNe11I-pyfwuPcYiFtHnTfLOsY2UOWk74i9jnVXZdI1jG9J8_B4aL37HWVkfJv8CYxvB4cPS_5XAAAA___gwTM6">