<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/72530>72530</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [x86] Work around slow compress store instruction on znver4
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          farnoy
      </td>
    </tr>
</table>

<pre>
    Zen 4 has a microcoded, slow implementation of `V{,P}COMPRESS*` with a memory destination operand. It doesn't have this issue with register-to-register compress, and it also doesn't have this for any of the expand load instructions either.

I'd like to contribute this optimization but I don't have experience with LLVM and could use some pointers.

For the repro case, I've created an example on [Compiler Explorer](https://godbolt.org/z/9nTvrj6Tq).

After the fix, I would expect the output of `-mcpu=znver4` to use this instruction sequence instead of a compressstore:
1. compress into a vector register
2. `kmov` to move the mask into a GPR
3. `pext` to compress the mask
5. `kmov` to move the mask to the vector domain
6. masked store

References:
- https://www.uops.info/html-instr/VCOMPRESSPD_M512_ZMM.html#ZEN
- https://www.mersenneforum.org/showthread.php?p=614191
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJx8lE1v4zgPxz-NciFi2HLtNAcf2qZ-UGD6bNEZdIFeBrJF15raokeik7SffiE5zUwXu3tJBIsvvz9JUXlvXixiJYprUexWauaeXNUpZ-lt1ZB-q57RwgX0yoOC0bSOWtKohbwBP9ABzDgNOKJlxYYsUAeiTJ_E5lrImwex2d38cf_wePv1q5BXokzhYLgPgXAk9wYaPRt78pzQKasTuGPQhN4KuWHo1R6Be-PBeD_j4u_wxXhGt2Zaf5yhpXFy6H0gU1aDYVCDp3-M1ZEDZd8CLfcIeJyCx0BKg7Ge3dwGJA9ouEeXiHQn0qvl907IjYbBvCIwQUuWnWlmPgWmic1o3hdFzcxwB5p-y47HCZ1B256UfPnydB9pW5oHDbNH8DQiTGQso_OfUtfkIq7DyRG0ymPQGoD2CK1DxahBWcCjCk0BsiCK6xsaJzOgg9vjNJBDJ4qdkJc98-RFfiVkLWT9QrqhgRNyL0LW70LWW_tt736U334Kuf1EcdWFageOzhwjABwifNDWcryhmaeZT8OwHttpFvnu3e7RXYQhYIpKl7b-qjd4_DnH2oSPqHQIoM6N9UwOA3HEyJLzBRjLBAr22DK583QsdjIJDK8j7U-ZR4pTgDAq__rh-r-Hx8U8j-YTHvlkfk7y4bLYFf8dlimeT0SaRmXs4lgm0QI1LHp-q-wjduiCfn9WuYbPfTocDslMk0-M7UjIuudxWMcSClk_fby1h933-yKT35_v75NgIWT-fPv_f484ovNoLXbk5vE0A76nA_cOlU6mfhJ5PYl8V2YX2TZb6SrX23yrVlhlmzTNCllcFqu-alQhU5SX5ba4LFutiq5rJKalLDdpi0W3MpVMZZ5lWZmVeSllUsgiU81Gyqztsi7V4iLFUZkhGYZ9JFnFZ19tZJGnq0E1OPi4q6S0eFh2gpAyrC5XBZ91M794cZEOxrP_FYUND3HJHS9LUezgT3KvoBzNVi9r7Nzm2JVPU0kWltFdzW6o_vZuDPdzk7Q0ClmHbKe_9eToB7YsZB0ZvZB11PBXAAAA__-ZRMom">