<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/87317>87317</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Supposed performance regression while generating fused SIMD fmadd, starting from clang 11

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            clang

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          galanhir

      </td>

    </tr>

</table>

<pre>

    Dear developers,

I have noted a change of behaviour, starting from clang 11, which I assume is a performance regression related to SIMD multiplication / addition fuse.

The (simplified) loop is extracted from the following function: 

```

#include <immintrin.h>

int test_fmadd_pd(float *out, const float *in, const double *c, int len, double *h)

{

    const __m256d va1 = _mm256_load_pd(c + 4);

    const __m256d va2 = _mm256_load_pd(c);

    const __m256d vb1 = _mm256_load_pd(c + 12);

    const __m256d vb2 = _mm256_load_pd(c + 8);

    __m256d vh0 = _mm256_load_pd(h);

    __m256d vh1 = _mm256_load_pd(h + 4);

    __m256d vout = _mm256_cvtps_pd(_mm_loadu_ps(in)); ++in;

    for (int s = 0; s < len; s++)

    {

        __m128d x = _mm256_extractf128_pd(vout, 0); // extract lower part

        x =  _mm_cvtss_sd(x, _mm_load_ss(in)); // insert

        __m256d vin = _mm256_insertf128_pd(vout, x, 0); ++in;

 __m256d vh0a2 = _mm256_mul_pd(vh0, va2);

        __m256d vh1a1 = _mm256_mul_pd(vh1, va1);

        __m256d vh0b2 = _mm256_mul_pd(vh0, vb2);

        __m256d vh1b1 = _mm256_mul_pd(vh1, vb1);

        __m256d vh2 = _mm256_add_pd(vin, vh0a2);

        vh2 = _mm256_add_pd(vh2, vh1a1);

       vin = _mm256_add_pd(vh2, vh0b2);

        vin = _mm256_add_pd(vin, vh1b1);

        vh0 = vh1;

        vh1 = vh2;

 vout = _mm256_permute4x64_pd(vin, 147);

        _mm_store_ss(out, _mm_cvtsd_ss(_mm_set_ss(0.0f), _mm256_extractf128_pd(vout, 0))); ++out;

 }

    return 0;

}

```

The block is compiled with the following agressive optimisation options:

```

clang++ -O3 -march=haswell -ffast-math

```

Compilations happen from a stock clang and can be reproduced from godbolt.org.

When built from e.g. clang 12 the inner loop appears as:

```

.LBB2_4: # =>This Inner Loop Header: Depth=1

...

        vfmadd231pd     %ymm5, %ymm10, %ymm13   # ymm13 = (ymm10 * ymm5) + ymm13

        vfmadd231pd     %ymm1, %ymm12, %ymm13   # ymm13 = (ymm12 * ymm1) + ymm13

        vmovapd %ymm6, %ymm2

        vfmadd213pd     %ymm13, %ymm10, %ymm2   # ymm2 = (ymm10 * ymm2) + ymm13

        vfmadd231pd     %ymm12, %ymm7, %ymm2    # ymm2 = (ymm7 * ymm12) + ymm2

        vpermpd $147, %ymm2, %ymm11     # ymm11 = ymm2[3,0,1,2]

...

```

whereas from clang 9, it appears as:

```

.LBB2_4:                                # =>This Inner Loop Header: Depth=1

...

        vmulpd  %ymm9, %ymm11, %ymm2         # ymm2 = ymm11 * ymm9

        vfmadd231pd     %ymm8, %ymm12, %ymm2  # ymm2 = (ymm12 * ymm8) + ymm2

        vmulpd  %ymm10, %ymm11, %ymm5        # ymm5 = ymm10 * ymm11

        vaddpd  %ymm1, %ymm2, %ymm11         # ymm11 = ymm1 + ymm5

        vfmadd231pd     %ymm13, %ymm12, %ymm5 # ymm5 = (ymm12 * ymm13) + ymm5

        vaddpd  %ymm11, %ymm5, %ymm1         # ymm1 = ymm5 + ymm11

        vpermpd $147, %ymm1, %ymm2    # ymm2 = ymm1[3,0,1,2]

...

```

Note that, in the clang 12 version, dependencies between the `vfmadd` instruction prevent any `vfmadd` to be started before the previous `fmadd` has completed. On the other hand, in the clang 9 version, there is room for parallelism, with some `mul` happening in parallel of the `fmadd`, probably resulting in a better loop throughput.

In other terms, clang 12 seems to fuse systematically the multiplication and the addition, even if it seems not beneficial to do so. 

Further analysis, through the LLVM MCA traces, shows that LLVM acknowledges that clang 12 generated loop is slower than clang 9 generated loop. Compare for instance the MCA report from the clang 12 loop:

```$ llvm-mca -mcpu=haswell -iterations=100 build/.../Fmadd_m256_avx2.cpp.s

[0] Code Region

Iterations: 100

Instructions:      3100

Total Cycles:      4820

Total uOps: 3700

Dispatch Width:    4

uOps Per Cycle:    0.77

IPC: 0.64

Block RThroughput: 11.0

```

vs the MCA report from the clang 9 loop:

```

$ llvm-mca -mcpu=haswell -iterations=100 build_clang9/.../Fmadd_m256_avx2.cpp.s

[0] Code Region

Iterations: 100

Instructions:      3300

Total Cycles:      2815

Total uOps:        3900

Dispatch Width:    4

uOps Per Cycle: 1.39

IPC:               1.17

Block RThroughput: 9.8

```

Maybe by chance, the cycles reported by MCA match well performance measurement run on an Intel  CPU compatible with the Haswell architecture:

```

clang 12:  4820 cycles, 0.453 ms

clang 9:  2815 cycles, 0.288 ms

```

I understand that `--fast-math` implies `fp contract(fast)`. 

It looks to me (but, maybe i am wrong) that this a regression, as, i would expect that the aggressive fp contraction trigged by `fp contract(fast)` would occur _only_ when beneficial, as, it looks, clang 9 and clang 10 are doing. 

So far, i have not found a way to get clang 12 to behave like clang 9 on this loop.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8WUtv4zgS_jXKpRBBovyQDzkk8QQToHtnMN27czQoqWxxmyIFkrLjf78oUpYlx3G6McAGjbQk1uOrB6tYDLdW7BTiQzR_iubrO965WpuHHZdc1cLcFbo6PqyRG6hwj1K3aGzEnqNkHSWP4fcr1HyPoLTDCjiUNVc7BL2FAmu-F7ozEXsG67hxQu1ga3QDpeRqB2lKK4dalDW8Are2axCEBQ4tmq02DVclgsGdQWuFVmBQctLiNHx7_bqGppNOtFKU3NFyxF6AV5XwL9vOYjzG-b1GiFhuRdNKsRVYRWwFUuuWVOKbM7wk2R6fqxG2Wkp98JA7VZLMKHuEscRokfT_wivLhCplVyFE2bNoGqGcESquo-y3MZtQDhxat9k2vKo2bRWxfCs1dxCxR905ckqplXUwfBXq_LHSXSHJlMeSPpI0iX79vFJHbNWrXD6FBwDoBWw2DZsvKtjzFKJsDZuG3jdS8x5MCRF7ghnJyG5wsw-4P-ErbmpN2WfsH6n17Pkl98BXJ9f56o85PgBaX3fPwKc7N2Ys9661gXPTNF5Ot2ltxHKK6iqIIZERexJqEHkWvNUGPLUD6wUnxECPzz7y9BLYh6gT2yTyPcCU5RW8jdH1ib9NWR4w7vsMTAZgL7SvejqQ-oAGWm7cVHgQSlLJYGs3loS9kaST1Rv73mgvWyiLl_IGbwo1hhtI36N9m0K-8OU4C6Zp23SyF1QnJGLP3-XfRU5MN82IPw386W3-pLitv_hMf3Fbf_GJ_on2ofzsQ4Hx7rnG_yFfzQJfetXui-C9Z0uum_sh3wlmet3M0yYnZ7xfS_s1dl673KstmqZzOHtbzCYa09nyqlubZmOdNhhSu8_F0x7oE94ToQsvSZxsff4__9wGvKgPtDSAiJbry1Jh0HVG-QJxKv_r671qaImF1OUPaoGlblohsYKDcPVFB-ShBe8RdOtEI2xouPSilY2y6w3Rd_kAHO7_yOC-4aaso2xdc3tAKeF-u-XW3Tfc1TdQPntgXqOFmrctqtClOVhH4MNpgqsKSq6goCNDa3TVlad2vtNVoaWLtdlNjgR_16ig6IR0gQ7jXXw6mzDvAqEUmnBKIMXcWOAf2ht_eXpim5k_JrCM8irKfvteCwuvXswXEvM78goN0ayxdeSNtOeOB2yrvT8asCxtqyhZRWx-bJo5JUV4TJPRc0a1nmUQnimZI5Z7IjoJQOBc-Z7lSW4qSUeC2c8oYScl6QdKGr3nIwWLs1B2iSTNxkiy6_ayMxJ21Vr2C9aOTFxOVVzTsRxsHesYmUHloxc_8yXjbOrZlhRG4tNQlDzJ_IlMJlMpCiyary8TY5puhxoNcjs-UfvCItyv5eonP_80lZtOjly-GnviwuXwzu-9h4LXVzdjmV_PXHY9W4a0za9Hcgp6st9GqOcXoOcD6CEb0_QslFfV9a32Pj2upkh6Ajq_ndXZdVfMpzDfbeBs5Ir5h6jH5o_0vEN9Aj0f9mL6ExvlMiXeZcMv75Lw-1_aIbiauzAy-do-FPo9Ghov_QiFLaoKVSnQQoHugBiIo0USfB0tEjqwOtP5mRBag3tUDrg6Tomcplbk516soMCtNuhFEYfQnSXygbrmoQNLdFjF8EfQql2NBmquqne4V2PYROZHZ6N14weGlhsuJUphGz9kU0-3uvGGNJ0MKqmVUnsXaqCnwb2394SN-FujC17IIxi0NHQHJk4ucqcG6Wqju13ddm7SZF9Vb4ZD01g_x54cbxEbS46iYR3s0TpsuBMll_LoUVwM-NTi6fNpyCdh5H0QW6p6QZzSDgpUuBWl4JKkVxqsjiej-0tnPCauuDxaYYMTPX6v4cuX_3yFr8-PQIcz9Mu21gfrcyis8vKH0geJ1Q77z4NdO1Ro_E3F6X7BhrnJ1VwN8ZtSxUDnHG7Qh48yzN9-EBjCYbDVxp1vJwZdxHsu8hc7gM1Ayn1z35Qc7puy7caHL-FIfTjArdMk8SehKmIvtJXYy4u_nQgH8P0bi8u2jW2vYP6URPM1POsK4S_cUSjGER8JfoQ0SU6JMOwaO_SdbFj-rh2X8HwsJZ6XZzmbLHd_tH4xWyaT_b0WtuWurOFvUVEn8uyzsEY88CeaILtfS-Llsof15zN9S-JFT__kT8R_fR_S2VuRxsnV-rK3n8RodRGiy0ujXw_Rxgte_b8jld2MFMvT-bVI9T_Z6p8ELI2z1SRa0580Tpc3YreK8xut4Ss_FgjF0d9cltiXUyi9eX1MqYAffYwbD9pHZ3xJ2SC3ncGGOoHpFPhiBa_KoQR4_vPfvrZzJwqJ5_nq9z7MNBUJh6XrDN6cpSBl3njaFT1APynGs3kGjR1TrjwhBWVCyPJ8ILzqjVfoVIWGqk8Vqlq0SO7vz3Matb-mldQfqUe0UGrlJ9iI5URF8-oimVbbV0d74Icv9Y2_hS3CjNt43wvgDRyMpmFxFZS62t8Dn-9-iZp7IwQcdCcrwLcWS3ciR-C7YUgdoaK24YzY7UIIb0Duxeqy7AxstJLHDRz8gDg0kxGI3qJzO1uFATSEKQEq5JUWajf1xDcNW26CGadrc9jqTlXA4cCP5KEdjlqJP0V4Sil-nEsKWUUu8p3jrnrIqlW24nf4kC5TlsySLJ3f1Q_zbJbxWYnbalbNFtW2YDir8ixnSZ4v8ozdiQeWsFkyS1iymM3TecwznK2yDKu8WuSLIo1mCTZcyJhqFM3Pd8LaDh_yZZYu7yQvUFr_twPG-mmfTmR35sHXtKLb2WiWSGGdPUtwwkl8-Na1rbZYfXTXf6iFxFOPDLfwRO3v_cPJ5MO_Ktx1Rj7UzvkCFO4Yd8LVXRGXuonYCyHp_7tvjf4vUia8eMNsxF68bf8LAAD__2LeSYE">