<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/121823>121823</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            [X86] Deoptimization of shuffle intrinsics

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          SEt-t

      </td>

    </tr>

</table>

<pre>

    Godbolt: https://godbolt.org/z/Knofd3WoM

Combination of vpshufb (partially zeroing) and vpermd deoptimized into vpshufb+vpermd+vpblendd adding unnecessary instruction and wasting one ymm to store zero.

Also, loop instructions produced here are worse than gcc and msvc.

</pre>

<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJxcks-OnDwQxJ_GXFqL_AeY4cBhdvfjO0Q55ZBcjd2AI-NGtpnRzNNHMLtJlBNI3V1Vqp91Sm4KiB2rX1n9XugtzxS7b__ll1wMZO_d_2QH8pmpC8w5r4mpC5M9k_30HJQUJyb7B5P9l0CjVd_pK-MXxi9vtAwu6OwoAI1wXdO8jQMweV51zE57f4cHRnJhYrIFHSxcV4yLBYu0Zre4B1pwIdPnLZOvz43jZ_AYrAVtrQsTbCGgwZR0vIMLKcfNHM677E2nvO9QQLgvC2SClCniYV8-0158IibfwBOtfwskWCPZzaCFGSOCjgg3igkhzzrAZMxhsaSrKQvbKduqVhfYiZNqRKu4aIq5q5U1Qp5qoY1uBa-00lhVNVeyOamxHgvXSS5rLngj2pqLpmyasRJ6PCnLz2psz6ziuGjnS--vy9554VLasBNSnKUqvB7Qp4OilAFvcEyZlDvU2O1HL8M2JVZx71JOf2Syy_7A_-PcsPod3j-7_w1ur370uJOILiRnUrFF3_3zGlyet6E0tDDZ7-Ifn5c10k80mcn-iJSY7D8yXzv5KwAA__-WTtTN">