<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/131588>131588</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            x86 avx2 vpor is first done on calculation-heavy operands
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          ImpleLee
      </td>
    </tr>
</table>

<pre>
    See the code and the compilation result at https://godbolt.org/z/Kchh341vW . This code calculates vpor of several operands in the loop, where some operands are relatively cheap to calculate, while some are not. Compilation flags: `-O3 -std=c++2b -march=skylake`.

```c++
#include <experimental/simd>
#include <cstdint>
namespace stdx = std::experimental;

template <class T, std::size_t N>
using simd_of = stdx::simd<T, stdx::simd_abi::deduce_t<T, N>>;

using data_t = simd_of<std::uint64_t, 4>;

data_t f(data_t a, data_t b) {
    while (true) {
        data_t result = a;
        result |= (a << 1) & std::uint64_t(0x802008020080200);
        result |= a >> 1;
 result |= a >> 10;
        data_t temp = a << 50;
        result |= data_t([=](auto i) {
            if constexpr (i + 1 >= 4) return 0;
 else return temp[i + 1];
        });
        result &= b;
        if (all_of((result & ~a) == 0)) return a;
        a = result;
 }
}
```

The assembly of the loop is as follows.
```asm
.LBB0_1:
 vmovdqa %ymm4, %ymm3
        vpaddq  %ymm4, %ymm4, %ymm4
        vpand %ymm1, %ymm4, %ymm4
        vpsrlq  $1, %ymm3, %ymm5
        vpsrlq  $10, %ymm3, %ymm6
        vpor    %ymm6, %ymm5, %ymm5
        vpsllq  $50, %ymm3, %ymm6
        vpermq  $249, %ymm6, %ymm6 # latency 3 on skylake
 vpblendd        $192, %ymm2, %ymm6, %ymm6
        vpor    %ymm6, %ymm5, %ymm5 # ymm6 is heavy to calculate, but or'ed first
        vpor    %ymm3, %ymm5, %ymm5 # ymm3 and ymm4 are cheap to calculate, but or'ed later
 vpor    %ymm4, %ymm5, %ymm4
        vpand   %ymm0, %ymm4, %ymm4
 vptest  %ymm4, %ymm3
        jae     .LBB0_1
```

The critical path of this loop is `vpmov-> vpsll $50 -> vperm -> vpblend -> vpor -> vpor -> vpor -> vpand`, but if ymm6 is vpor'ed later, the other two vpor's does not need to be on the critical path.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyUVkuP4jgQ_jXmUgI5doBw4MBjkVY72j3sSHtETlzpeMaJM7aThjnsb185jyZ0Q2sWIXDsr7766mHHwjn1UiFuyXJPlseZaHxh7Pb3stb4BXGWGnnd_o0IvkDIjEQQlRweylpp4ZWpwKJrtAfhofC-doTvCDsRdnoxMjXaL4x9Iez0k7DTH1lR8Dhq_4EFfC2U6zkzobNGC48O2tpYMDk4bNEKDaZGKyrpQFWdW21MTdgBXgu0CM6UeIMIi2AxaGpRXyErUNTgzY2-N1R6MAz4yvgFHCax5Fq8hAiArOj8Lw5z5yXhx4ywPWF7lsK8FDYrCD-671ctviNZ0QWhu_Bd0f47gMMU46rKdCMRCD_gpUarSqy80ISdnCol4b99gGXOS1X5fqkSJbpaZAjOywsQfoRO0Y7w3R0f3_cqPJZ1CLZj0sI5-BrifjNy6ieePfzZ0zdOVS8QlJxNPrJfRmTQdxjNJ7Nnkar-SaJsMjz7EdfRhu-gpueXwouz7-l7V4Qf3hQ1qvKr-OyDfTw1HsxywpJhKAJmGKeEbYCsAxYAhsISlnjb4P1S-AxGQ6sGJaL3MwLGlfUhLBKWiJBBwg8QdWxsBQ8UJ_SSUEbp7YewzWfEgTVkCKIB9WyZ3rMM-kN1YQR26pb0M3e9GWFJ2OD8SJbHEFnjDaiPOQoflUNmKufxUtuQBQWE7SHqZR0hDlYWfWMrGB2jdjjOBX1kuR-sgrs7cWR9fJoftgoO0vtFlXel0Dr0DEsIS25w-Fd0MfBjMOwSfxP3rrqiS1pvO6wELXQ3_I5bt2-8rwWCcA7LVF_DaTSePKAcCAe50dq8usXUULiS0N3iy35Pz1HoErqDtjSt_CGAsOW1LOPQu_2QT5S1tZDyB3wE3Q3v8JUc5qNfQDurO_Y4mvp_Gy6fgOlD9OoObWxX02FlQvmMXQ_sy19hR1v2aBZvppjbEAjjEM66KrsCB1PBeCaH5NepxkrKt9ZjcbRhN2v2mPP_Bthp6MQoBwWK9vrhlZM2HowlbI0ScmWdf-qDf-6Dd2_fUOHu3fX4BTf1FuZsn4ypm_ihm49NNuLp8zZra4_OPyCetvg3gd3_uDse7rfMKq8yoaEWvug3nXJvu46saFuXpp2Hs7HrpL6PYJhAW47Druzjg7GfjUQlg4Yhayp_K2QATVPIDt0ZYHyBFvyrGQEOpEEXbhFQIcpQjBRDI_r3ES1mcsvlhm_EDLfROmaMxnTNZsV2mcUpXeebLMryZLmSmzTOE0azTR5HOV-LmdoyypaUR2u64TFni3yzwTRJ18k6i6JolZGYYimUXmjdluG2NVPONbiNeLRMkpkWKWrXXfEYq_AVulXCWLjx2W0wmqfNiyMx1cp5d6PxymvcXpIViPbC-tQp1zcxSFN1oY7dp0w17zfAeCObNVZv310JlS-adJGZkrBT8DP8zWtrvmHmCTt16hxhp0F-u2X_BQAA__-z-RMx">