<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/137422>137422</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [X86] Suboptimal code for AVX-512 narrowing
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          dzaima
      </td>
    </tr>
</table>

<pre>
    This code, compiled via `-O3 -march=znver4`:

```c
#include <immintrin.h>
#include <stdint.h>

void narrow_u32x16x4_to_u8x64(uint8_t* dst, __m512i x0, __m512i x1, __m512i x2, __m512i x3) {
    __m512i inds = _mm512_set_epi8(
        124, 120, 116, 112, 108, 104, 100, 96,
         92,  88,  84,  80,  76,  72,  68, 64,
 60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12, 8,   4,  0,
        124, 120, 116, 112, 108, 104, 100, 96,
         92, 88,  84,  80,  76,  72,  68, 64,
         60,  56,  52,  48,  44,  40,  36, 32,
         28,  24,  20,  16,  12,   8,   4,  0
    );
    __m512i x01 = _mm512_permutex2var_epi8(x0, inds, x1);
    __m512i x23 = _mm512_permutex2var_epi8(x2, inds, x3);
    __m512i x0123 = _mm512_mask_blend_epi64(0xF0, x01, x23);
    _mm512_storeu_si512(dst, x0123);
}
```
produces:
```asm
narrow_u32x16x4_to_u8x64:
 vmovdqa64       zmm4, zmmword ptr [rip + .LCPI0_0]
        vmovdqa64 zmm5, zmmword ptr [rip + .LCPI0_1]
        vpshufb zmm1, zmm1, zmm4
 vpshufb zmm0, zmm0, zmm5
        vpshufb zmm3, zmm3, zmm4
        vpshufb zmm2, zmm2, zmm5
        vporq   zmm0, zmm0, zmm1
        vporq   zmm1, zmm2, zmm3
        vpmovsxbd       zmm3, xmmword ptr [rip + .LCPI0_3]
 vpermi2d        zmm3, zmm0, zmm1
        vmovdqu64       zmmword ptr [rdi], zmm3
        vzeroupper
        ret
```
instead of the more direct version that gcc produces:
```asm
narrow_u32x16x4_to_u8x64:
 vmovdqa64       zmm4, ZMMWORD PTR .LC0[rip]
        kmovb   k1, BYTE PTR .LC1[rip]
        vpermt2b        zmm0, zmm4, zmm1
        vpermi2b zmm4, zmm2, zmm3
        vmovdqa64       zmm0{k1}, zmm4
        vmovdqu64 ZMMWORD PTR [rdi], zmm0
        ret
```

The code implements a general 64-element `u32` to `u8` narrow, and should have 2x higher throughput than using `vpmovdb` as clang currently does via autovectorization on both Intel and AMD (and allows doing merging of multiple results via a blend instead of insert, which can run on more ports), so that's perhaps a separate thing that could be improved. I believe similar approaches should get a ~2x throughput boost for all narrowing conversions, on both Intel and AMD.

https://godbolt.org/z/ax7Yda7Ps
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJy8Vk2P2zYQ_TX0ZbAGRX364IN3HQMBGiRIF23Si0GJtMVGFBWS0io-9LcXJOVd2eskKNDWMDAW-fiG8-ZRJjVGHFvO1yi9R-l2QXtbK71mJyokXZSKfVs_1sJApRhH5AEqJTvRcAaDoIAyfPc-hjtJdVWjeHtqB64TlGEUbxD23wyHb-UeSCzaqukZBxQ_CClFa7VolzWK37yaNpaJ1j7P4c2gBIOWaq2e9n1Mxigbk71V-74YswSRohetLfYWkQ0wY91e93uZRkTAiC-eoosncvEUI7IClN8jvAGA53HRMgMo3sJeuoG94XbPO1EgUkxI94lI4sgi4vNFURaCzxDhIoQAwR6ycogZAaw8FooihCQEj4U8CyFAMg9xhXuCLGDSgEkDJgk0SaBJAiT2kJhcZSYBGyqAUAGECiBUEAAQAPhy-b9V-T8v_Pz5HwSAawnCakRWKL42zIijuV86rmVv-UgGqs_GCa50znLRufIWD4l_ykMueOLv7eeSSVLzZV82vGWOxx8gPO78lkbsT8hIrqgm61uleb83InWqFNNR8_zPcJRv52cf4U2nFesrbqYXw3mGGonw5ruH2oNhkGpgX2mWTJ06Sek7cJLySWkGndWA0nstOkDkHpa_PHx4i_cYpdtZf19ITlKmP18eXS3vTN0fSrcomhafY-L3-DKPp_FzTG_TxNN0PKd5DSPTNLnJpvTXIMh10ug2LLpiiy9gUg1mLNmLzH5v4w91iiedBmdNQc6L55Xd3JNvRz_v6UUSJhzvjU2euFZ913E9G9TcXtlNtMZyykAdwNYcpNIcmNC8sjBwbYRqwdbUwrGq4L-w5h_v3v3-_uMWPjx-dELhoNulpb5INZQu-qbcf358c4ZHt-BeYEvKmcD42Tq3mu77Uc7nb3b9dQUY5fdfIneGbzjzuW_zEq87hn_YHIQ3jzX3NwoQsmu45K01QOHIW65pA1lyx8Oou2H0MUEZBqv8Q-F-h664ZLRlYGrVNwxqOnAgI9TiWHMNttaqP9Zdb12rW-iNaI-OwvuclY6HGqga2h6h6rXmrW2-AVPc-LsN7a0aeGWVFidqnWNUC6WyNbxtLW985s27LSBSuJ-0adSTAaZcFsn10UV1ANk3VnQNB81N39iJG_ybF2Y2Fa3h2r9Kn2pR1VDRFnTvk3r3dkpb416w5AGM8u5FJDfQcV3TzqlneEc1tRxs7XJ7f1demdLrrNXA2RLeQskbwQcORkjRUA2067SiVc3NWcojt0DhLzLOVSyVMhYOSrtapxa4RJVqpzPl_4BuqrQMba-t7fw5IztEdkfFStXYpdJHRHYnRHZ0zD8zmn8wC7aO2Spe0QVfR3mSZiTDcbKo14wRXB7SOF-tktWBJFGR4wKTmLGEccKihVgTTFKckAxjnEfJMq-qAvM8rg5sVSRpgRLMJRXNsmkG6XIvhDE9X0dxnhCyaGjJG-Nvw4S0_An8LCLEXY712i26K_ujQQluhLHmhcYK2_hr9KciQ-kWfu1L1VkhaROs7pTb_PbpLo3Ii3qLXjfrK1mErftyWSmJyM6xT-Gu0-pPXllEdn5PBpHdtOlhTf4OAAD__3CkVLQ">