<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/82813>82813</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            clang: _mm512_reduce_add_ps lowers to LLVM IR that does not reflect correct reduce order
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            clang
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          RalfJung
      </td>
    </tr>
</table>

<pre>
    This
```C
#include <immintrin.h>
float foo(__m512 x) {
    return _mm512_reduce_add_ps(x);
}
```
[produces](https://godbolt.org/z/qera4378s)
```
define dso_local noundef float @foo(float vector[16])(<16 x float> noundef %x) local_unnamed_addr #0 {
entry:
  %0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
 ret float %0
}
```
According to the [LangRef](https://llvm.org/docs/LangRef.html#fast-math-flags), the `reassoc` here means that the addition may happen in *any* order, which is not what [Intel documents](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=133&ig_expand=5303) -- they specify a particular, "tree-like" order.

Even worse, we can chain two of these operations:
```C
#include <immintrin.h>
float foo(__m512 x) {
    float xr = _mm512_reduce_add_ps(x);
    __m512 y = _mm512_set_ps(
        xr, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0,
 0.0, 1.8, 9.3, 0.0, 2.5, 0.0, 6.7, 9.0
    );
    return _mm512_reduce_add_ps(y);
}
```
Now the second addition may be arbitrarily re-associated with the first one. As far as I understand, there's nothing about `reassoc` that constrains the re-association to only happen "inside" a single operation (and indeed, as a fast-math flag it is explicitly intended to apply when multiple subsequent operations are all `reassoc`).

_mm512_reduce_add_ps should probably either use a vendor-specific intrinsic, or LLVM IR needs a version of `vector.reduce.fadd` that explicitly specifies the "tree-like" reduction order documented by Intel.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0VkFv4zYT_TX0ZWBBJi1ZPviQxGtgP-zXw6Lo1RiTI4stRWpJKrb76wtSipPNBovtoUYQgtbMcN7jvGdhCPpsiXasemTVfoFj7JzffUXT_m-058XJqdvu904HVu5Z-cDqcvp7mvdcaCvNqAiYeNJ9r2302hYdE5-miNY4jNA6x3hzPPbVisOV8S2wzeMUAADgKY7ewrFPz4-e1CjpiEodh8B4k-KZmMPZZv-ulXlbPQ7epcTAqj3jTRfjEJh4YPzA-OHs1MmZWDh_ZvzwN-OHb-RxLTZNSNU_qqio1ZZABXc0TqIB60arqIUJEluXE6pp-0wyOs-qx1Wdz98y3jDxtKrhOiUw8elegfEqk5DrHkdrsSeVAHtgXJSv5JCN_pZAzFQxXpXAxB4iagMSjQFPGIKTP3ZnzHNfTG0VE6VFi0oVz6u6Ffze-LIsyvwhxh_LkvEn-KHvud-5C0_x5RRelT-_lgcpnVfaniE6iB0Bqx6_oD1_pfaja8o9T3eknAyMH-bgoou9YVy0GOKyx9gtW4PnfHf8aSpclzMVrC6hI0_QE9oAscOYI1ApHbWz0OMNOhwGsqAtMP6A9sb4AzivyKd6l07LDnQA6yJcUj6rHj_bSAaUk2NPNn44ZZfLpdAprJCuZ_wgnY1k4_SE8cOYEJF9RTfJJWgZludRK8pfKbq-wI10jUzsPxZGTdcBrWJivxKC8Vqfj_dvKlGKNGHLZYJ-gzCQ1O0NEAb0UcvRYEbKOI-eaGn0X8Q4nygo5qvM_z89k4WL84EyMwQSLcgOtYV4ceDadEAgcAN5TPSG-7z-R24xBV19FsKveEZKmqvd3iYFilP0a1j6XDMzq6JJy7YQaSmLrAxeVG92dbGZQtJurjE_-nfZ9-Pfd_1zY7z9ijH-5i55-gNJZ9X3IjgRoD_p6NFrcwNPyywgjZEUXHTscmarfYjgLBXwEKBFDxjgMyS78SGmeZsk6InxTdZMlwSPJzfGd7LMWpTOhuhRZ23S21NTX9GBs-YuUMZ5EojK04kQtD2bN8MGjDdoFSTVUG4EAyDcbQKSTYCOScx0HYyWOpobJJFaRSqdhsNgbnDpyEI_mqgHQxDGU6BvI9n4Zq4BPUGy3O8wMb79Ti8f3RWEzo1GweDdCU_mBqQTXzAGAoRnssr55SRRLeFuCgmO8_Dlyx__h89fwRKpkON9SNBdmzr50eHvRL8BPBenifL3qs_Jmc6s_7vJkYLTDbLxFQu1E2ortrig3WpTNpvVei3qRbdbtw3KFW-Qb0WreI1iXa8k5w02W9lgvdA7XvJ1yblYNZWo6qJu2s1JiWqlym27rRu2LqlHbYoX-1_oEEbaNbxZiYXBE5mQX1A4lwbtmXGe3lX8LsUvT-M5pJ87HWJ4rRB1NLSbwsXDhwoC4y7kQ5qBF4ozbcrR5PyeWkMyDaz3aZ2yJ44Woze7d-8YOnbjabb-1Mi8LAfv_iSZfgYyrmT8Gdo_AQAA___YB_Ee">