<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/87317>87317</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Supposed performance regression while generating fused SIMD fmadd, starting from clang 11
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            clang
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          galanhir
      </td>
    </tr>
</table>

<pre>
    Dear developers,

I have noted a change of behaviour, starting from clang 11, which I assume is a performance regression related to SIMD multiplication / addition fuse.

The (simplified) loop is extracted from the following function: 

```
#include <immintrin.h>

int test_fmadd_pd(float *out, const float *in, const double *c, int len, double *h)
{
    const __m256d va1 = _mm256_load_pd(c + 4);
    const __m256d va2 = _mm256_load_pd(c);
    const __m256d vb1 = _mm256_load_pd(c + 12);
    const __m256d vb2 = _mm256_load_pd(c + 8);
    __m256d vh0 = _mm256_load_pd(h);
    __m256d vh1 = _mm256_load_pd(h + 4);
    __m256d vout = _mm256_cvtps_pd(_mm_loadu_ps(in)); ++in;

    for (int s = 0; s < len; s++)
    {
        __m128d x = _mm256_extractf128_pd(vout, 0); // extract lower part
        x =  _mm_cvtss_sd(x, _mm_load_ss(in)); // insert
        __m256d vin = _mm256_insertf128_pd(vout, x, 0); ++in;
 __m256d vh0a2 = _mm256_mul_pd(vh0, va2);
        __m256d vh1a1 = _mm256_mul_pd(vh1, va1);
        __m256d vh0b2 = _mm256_mul_pd(vh0, vb2);
        __m256d vh1b1 = _mm256_mul_pd(vh1, vb1);
        __m256d vh2 = _mm256_add_pd(vin, vh0a2);
        vh2 = _mm256_add_pd(vh2, vh1a1);
       vin = _mm256_add_pd(vh2, vh0b2);
        vin = _mm256_add_pd(vin, vh1b1);
        vh0 = vh1;
        vh1 = vh2;
 vout = _mm256_permute4x64_pd(vin, 147);
        _mm_store_ss(out, _mm_cvtsd_ss(_mm_set_ss(0.0f), _mm256_extractf128_pd(vout, 0))); ++out;
 }

    return 0;
}
```

The block is compiled with the following agressive optimisation options:
```
clang++ -O3 -march=haswell -ffast-math
```

Compilations happen from a stock clang and can be reproduced from godbolt.org.

When built from e.g. clang 12 the inner loop appears as:
```
.LBB2_4: # =>This Inner Loop Header: Depth=1
...
        vfmadd231pd     %ymm5, %ymm10, %ymm13   # ymm13 = (ymm10 * ymm5) + ymm13
        vfmadd231pd     %ymm1, %ymm12, %ymm13   # ymm13 = (ymm12 * ymm1) + ymm13
        vmovapd %ymm6, %ymm2
        vfmadd213pd     %ymm13, %ymm10, %ymm2   # ymm2 = (ymm10 * ymm2) + ymm13
        vfmadd231pd     %ymm12, %ymm7, %ymm2    # ymm2 = (ymm7 * ymm12) + ymm2
        vpermpd $147, %ymm2, %ymm11     # ymm11 = ymm2[3,0,1,2]
...
```
whereas from clang 9, it appears as:
```
.LBB2_4:                                # =>This Inner Loop Header: Depth=1
...
        vmulpd  %ymm9, %ymm11, %ymm2         # ymm2 = ymm11 * ymm9
        vfmadd231pd     %ymm8, %ymm12, %ymm2  # ymm2 = (ymm12 * ymm8) + ymm2
        vmulpd  %ymm10, %ymm11, %ymm5        # ymm5 = ymm10 * ymm11
        vaddpd  %ymm1, %ymm2, %ymm11         # ymm11 = ymm1 + ymm5
        vfmadd231pd     %ymm13, %ymm12, %ymm5 # ymm5 = (ymm12 * ymm13) + ymm5
        vaddpd  %ymm11, %ymm5, %ymm1         # ymm1 = ymm5 + ymm11
        vpermpd $147, %ymm1, %ymm2    # ymm2 = ymm1[3,0,1,2]
...
```

Note that, in the clang 12 version, dependencies between the `vfmadd` instruction prevent any `vfmadd` to be started before the previous `fmadd` has completed. On the other hand, in the clang 9 version, there is room for parallelism, with some `mul` happening in parallel of the `fmadd`, probably resulting in a better loop throughput.

In other terms, clang 12 seems to fuse systematically the multiplication and the addition, even if it seems not beneficial to do so. 

Further analysis, through the LLVM MCA traces, shows that LLVM acknowledges that clang 12 generated loop is slower than clang 9 generated loop. Compare for instance the MCA report from the clang 12 loop:

```$ llvm-mca -mcpu=haswell -iterations=100 build/.../Fmadd_m256_avx2.cpp.s
[0] Code Region

Iterations: 100
Instructions:      3100
Total Cycles:      4820
Total uOps: 3700

Dispatch Width:    4
uOps Per Cycle:    0.77
IPC: 0.64
Block RThroughput: 11.0
```
vs the MCA report from the clang 9 loop:
```
$ llvm-mca -mcpu=haswell -iterations=100 build_clang9/.../Fmadd_m256_avx2.cpp.s
[0] Code Region

Iterations: 100
Instructions:      3300
Total Cycles:      2815
Total uOps:        3900

Dispatch Width:    4
uOps Per Cycle: 1.39
IPC:               1.17
Block RThroughput: 9.8
```

Maybe by chance, the cycles reported by MCA match well performance measurement run on an Intel  CPU compatible with the Haswell architecture:
```
clang 12:  4820 cycles, 0.453 ms
clang 9:  2815 cycles, 0.288 ms
```

I understand that `--fast-math` implies `fp contract(fast)`. 

It looks to me (but, maybe i am wrong) that this a regression, as, i would expect that the aggressive fp contraction trigged by `fp contract(fast)` would occur _only_ when beneficial, as, it looks, clang 9 and clang 10 are doing. 

So far, i have not found a way to get clang 12 to behave like clang 9 on this loop.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8WUtv4zgS_jXKpRBBovyQDzkk8QQToHtnMN27czQoqWxxmyIFkrLjf78oUpYlx3G6McAGjbQk1uOrB6tYDLdW7BTiQzR_iubrO965WpuHHZdc1cLcFbo6PqyRG6hwj1K3aGzEnqNkHSWP4fcr1HyPoLTDCjiUNVc7BL2FAmu-F7ozEXsG67hxQu1ga3QDpeRqB2lKK4dalDW8Are2axCEBQ4tmq02DVclgsGdQWuFVmBQctLiNHx7_bqGppNOtFKU3NFyxF6AV5XwL9vOYjzG-b1GiFhuRdNKsRVYRWwFUuuWVOKbM7wk2R6fqxG2Wkp98JA7VZLMKHuEscRokfT_wivLhCplVyFE2bNoGqGcESquo-y3MZtQDhxat9k2vKo2bRWxfCs1dxCxR905ckqplXUwfBXq_LHSXSHJlMeSPpI0iX79vFJHbNWrXD6FBwDoBWw2DZsvKtjzFKJsDZuG3jdS8x5MCRF7ghnJyG5wsw-4P-ErbmpN2WfsH6n17Pkl98BXJ9f56o85PgBaX3fPwKc7N2Ys9661gXPTNF5Ot2ltxHKK6iqIIZERexJqEHkWvNUGPLUD6wUnxECPzz7y9BLYh6gT2yTyPcCU5RW8jdH1ib9NWR4w7vsMTAZgL7SvejqQ-oAGWm7cVHgQSlLJYGs3loS9kaST1Rv73mgvWyiLl_IGbwo1hhtI36N9m0K-8OU4C6Zp23SyF1QnJGLP3-XfRU5MN82IPw386W3-pLitv_hMf3Fbf_GJ_on2ofzsQ4Hx7rnG_yFfzQJfetXui-C9Z0uum_sh3wlmet3M0yYnZ7xfS_s1dl673KstmqZzOHtbzCYa09nyqlubZmOdNhhSu8_F0x7oE94ToQsvSZxsff4__9wGvKgPtDSAiJbry1Jh0HVG-QJxKv_r671qaImF1OUPaoGlblohsYKDcPVFB-ShBe8RdOtEI2xouPSilY2y6w3Rd_kAHO7_yOC-4aaso2xdc3tAKeF-u-XW3Tfc1TdQPntgXqOFmrctqtClOVhH4MNpgqsKSq6goCNDa3TVlad2vtNVoaWLtdlNjgR_16ig6IR0gQ7jXXw6mzDvAqEUmnBKIMXcWOAf2ht_eXpim5k_JrCM8irKfvteCwuvXswXEvM78goN0ayxdeSNtOeOB2yrvT8asCxtqyhZRWx-bJo5JUV4TJPRc0a1nmUQnimZI5Z7IjoJQOBc-Z7lSW4qSUeC2c8oYScl6QdKGr3nIwWLs1B2iSTNxkiy6_ayMxJ21Vr2C9aOTFxOVVzTsRxsHesYmUHloxc_8yXjbOrZlhRG4tNQlDzJ_IlMJlMpCiyary8TY5puhxoNcjs-UfvCItyv5eonP_80lZtOjly-GnviwuXwzu-9h4LXVzdjmV_PXHY9W4a0za9Hcgp6st9GqOcXoOcD6CEb0_QslFfV9a32Pj2upkh6Ajq_ndXZdVfMpzDfbeBs5Ir5h6jH5o_0vEN9Aj0f9mL6ExvlMiXeZcMv75Lw-1_aIbiauzAy-do-FPo9Ghov_QiFLaoKVSnQQoHugBiIo0USfB0tEjqwOtP5mRBag3tUDrg6Tomcplbk516soMCtNuhFEYfQnSXygbrmoQNLdFjF8EfQql2NBmquqne4V2PYROZHZ6N14weGlhsuJUphGz9kU0-3uvGGNJ0MKqmVUnsXaqCnwb2394SN-FujC17IIxi0NHQHJk4ucqcG6Wqju13ddm7SZF9Vb4ZD01g_x54cbxEbS46iYR3s0TpsuBMll_LoUVwM-NTi6fNpyCdh5H0QW6p6QZzSDgpUuBWl4JKkVxqsjiej-0tnPCauuDxaYYMTPX6v4cuX_3yFr8-PQIcz9Mu21gfrcyis8vKH0geJ1Q77z4NdO1Ro_E3F6X7BhrnJ1VwN8ZtSxUDnHG7Qh48yzN9-EBjCYbDVxp1vJwZdxHsu8hc7gM1Ayn1z35Qc7puy7caHL-FIfTjArdMk8SehKmIvtJXYy4u_nQgH8P0bi8u2jW2vYP6URPM1POsK4S_cUSjGER8JfoQ0SU6JMOwaO_SdbFj-rh2X8HwsJZ6XZzmbLHd_tH4xWyaT_b0WtuWurOFvUVEn8uyzsEY88CeaILtfS-Llsof15zN9S-JFT__kT8R_fR_S2VuRxsnV-rK3n8RodRGiy0ujXw_Rxgte_b8jld2MFMvT-bVI9T_Z6p8ELI2z1SRa0580Tpc3YreK8xut4Ss_FgjF0d9cltiXUyi9eX1MqYAffYwbD9pHZ3xJ2SC3ncGGOoHpFPhiBa_KoQR4_vPfvrZzJwqJ5_nq9z7MNBUJh6XrDN6cpSBl3njaFT1APynGs3kGjR1TrjwhBWVCyPJ8ILzqjVfoVIWGqk8Vqlq0SO7vz3Matb-mldQfqUe0UGrlJ9iI5URF8-oimVbbV0d74Icv9Y2_hS3CjNt43wvgDRyMpmFxFZS62t8Dn-9-iZp7IwQcdCcrwLcWS3ciR-C7YUgdoaK24YzY7UIIb0Duxeqy7AxstJLHDRz8gDg0kxGI3qJzO1uFATSEKQEq5JUWajf1xDcNW26CGadrc9jqTlXA4cCP5KEdjlqJP0V4Sil-nEsKWUUu8p3jrnrIqlW24nf4kC5TlsySLJ3f1Q_zbJbxWYnbalbNFtW2YDir8ixnSZ4v8ozdiQeWsFkyS1iymM3TecwznK2yDKu8WuSLIo1mCTZcyJhqFM3Pd8LaDh_yZZYu7yQvUFr_twPG-mmfTmR35sHXtKLb2WiWSGGdPUtwwkl8-Na1rbZYfXTXf6iFxFOPDLfwRO3v_cPJ5MO_Ktx1Rj7UzvkCFO4Yd8LVXRGXuonYCyHp_7tvjf4vUia8eMNsxF68bf8LAAD__2LeSYE">