<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/58339>58339</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [X86] Redundant instructions when shuffling and returning lower 128-bit of 256-bit register
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          damageboy
      </td>
    </tr>
</table>

<pre>
    The following code ([Compiler-Explorer Link](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,selection:(endColumn:5,endLineNumber:5,positionColumn:5,positionLineNumber:5,selectionStartColumn:5,selectionStartLineNumber:5,startColumn:5,startLineNumber:5),source:'%23include+%22immintrin.h%22%0A%0Aextern+__m128i+not_so_great_for_clang(__m256i+u1,+__m256i+u2,+short+*p)+%7B%0A++++auto+packed+%3D+_mm256_packus_epi32(u1,+u2)%3B%0A++++packed+%3D+_mm256_permute4x64_epi64(packed,+0b11!'01!'10!'00)%3B%0A%0A++++return+_mm256_castsi256_si128(packed)%3B%0A%7D%0A'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:42.347908745247146,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:g111,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c%2B%2B,libs:!((name:fmt,ver:trunk)),options:'-std%3Dc%2B%2B20+-march%3Dnative+-O2',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64+gcc+11.1+(C%2B%2B,+Editor+%231,+Compiler+%231)',t:'0')),k:47.97605072595937,l:'4',m:46.87493678757056,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang1500,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:c%2B%2B,libs:!((name:fmt,ver:'811')),options:'-std%3Dc%2B%2B20+-march%3Dnative+-O2',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64+clang+15.0.0+(C%2B%2B,+Editor+%231,+Compiler+%232)',t:'0')),header:(),l:'4',m:53.12506321242945,n:'0',o:'',s:0,t:'0')),k:57.65209125475286,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)):

```c
#include "immintrin.h"

extern __m128i not_so_great_for_clang(__m256i u1, __m256i u2, short *p) {
    auto packed = _mm256_packus_epi32(u1, u2);
    packed = _mm256_permute4x64_epi64(packed, 0b11'01'10'00);

    return _mm256_castsi256_si128(packed);
}
```

Triggers promotion into shuffles and fails in lowering this efficiently:

```asm
        vpackusdw       ymm0, ymm0, ymm0
        vextracti128    xmm1, ymm0, 1
        vpunpcklqdq     xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0]
        vzeroupper
        ret
```

Compared to the optimal code-gen with GCC, for example:

```asm
        vpackusdw       ymm0, ymm0, ymm1
        vpermq  ymm0, ymm0, 216
        ret
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJztWEtv4zYQ_jX2hbAhUqIeBx9ie7eXoim626I3Qw9K5oYSHZJKnP76DknZkWInWexu0R5qGJZmOA9yZr7RyIWsnlaf9wzVUgj5yLsGlbJiaEbSGV1vZHvggqnFh-NBSMUU-pl3dzO6heW9MQc9C29m5CN8G1kVUpilVI1jhI1bwtYOSV8n9kBYhx8qbqSakQ13q2kNbru8ZY5KQDiBtVp25lOZC8vF0cD4XbNfj14s8GJf7DJcRd5ZX-WMULIefjaaCVYaLjvvh3XVRoq-tSSFZaDhiOyXvi2YGngHqbnVmAiemBfSZwefTK7MRGe6dKl5qXBNLrMLslflEBx7spB3pegrNvMHJbxteWcU75Z7R8NPcON-2NEw1YHcbtdiknK466TZablrFMvNrpZqV7rIkRRECI2tSG_j6ZVOHOI5ei-VcW5vDm5vdgPJenC4Pn_z3ki4HPLyjlVeKtxai621uLP8Xu_YgYdgOD35s14yK3pp8FVLTLW9YdExjqy5GOokPclak0GBsSu_JBiuOBjo4IWzFx4VM70PnfdU5tpobu80x66aT36mVpLtcJf45AmfN-qrtZvUrvTUZlyz6yHbnhEOYDBjvcxbvgNeRJZhlGRBmkSURAmO4meXkVdtLUDsaa8794TFdnDFz-Yr8Oy7xjOaz5zwpsE2_BvAN9Sh7x9pMdlFwbtcPU2AD_ot68xtJ54mohVroVIFmzK5sih7YHrCZkdWQl1MeIARJiYcwQsF3jfQkiZ8wFI73pINRC3yRv_B2ePtgXU_vFHBTvQ4skM3rFtA2-bBxdKoHrrxKfnyYFvLcOiFNpUDxsgoAbfrRZurcu-WutxGyfJuyZDz15sjvtIc8bXmOGZeSL_SHC-XLjUvFa7JTZojdolj7GXivgKAxzRe2NaxbkqI4BrjJXb4SyfItP3k9OQ6g9MyN2cInNnZO7BNllkSBzRICM1oFiZXURvFS8B1FsZJmtAkoPEPwvC7sHWPBExd0_iXsYv_M9gl34FdsJJiPK6C__H7T-B3mGXWmC6DZfA9GCZvYnjP8mpIbDrZ5Qi9NFximATikGASkSyi34beU8ugyTKmJMjAZpRQko4e9OEboXn9DIMy-SZlKGvtKy8aeCATbGfB6TcO_Lcc6PPQCi8b04GVjPX8wIqGcRW9N6wiNzqiM2UxityQioYRFc1gPHWmEXzsXIr83IZm4Ra9MY8iP42GI-0rim-Nn8gPn27ydGPnMHOeTD4b9rMm-opJ86wLY-Y00GObnxVvGsgQOijZSgtTBPGWEJm-rgXTKO8qVOdcaOAjeBNkyr4Lmj3XiNU1Lzn0cN_Br-U01-3z3u3nwQewehzop7a1pTy9TjUg0SqHDgIHtPSxbfFYA7900HeH8k7cV_doEHdioys-yUKpOb7LkxOg68C9xW6cl4Ga2v-LKdkfIJ1TPiTmjTDbnpErqAgIrYG3atvT21y4d-pFwzr0yM0e_bTZ2C1C9SJ2zNuDfwj-qLheBAoq8v5SkuD4vZPN2QrHMY2hvQTxvFqFVRZm-dxwI9gKovYntBy6Rb-xqu-qvDNQOhpmUvcU0OhxD-f19WVLyRaYr2pLuQpDkOpFwQ2SNYLidreKNVwD4ue9EqsX_zFA7PpiCQMFEEI8nC4LqOkv8OwBkmvdw-xAPtI0DLP5HvafhWXI4qrMSJWwgNSshoZJGM4iAq1uLvKCCW1PA32nY4_ImbA9iG7nfEUCQnCAwyCj8Ca1jMu6SFlRRYxWmNbxLApgjOFiafdh__yYq5XbUtE3GhYFnEU_L-Za86ZjLnjWPjQf6EyrKm_zhhXyae6cr9zm_wa6cd5s">