<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/90748>90748</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:X86,
            performance
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          RKSimon
      </td>
    </tr>
</table>

<pre>
    For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the off/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.
```cpp
__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}
```
```asm
  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
 vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4
```
llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyMlU2P4jgQhn-NuZRAiZNAcuAwwLAazR5Gi1aaG3KSSuIdf0R2he7Mr185NB9N06uVEJDy-z6Vsktl4b1sDeKaZRuW7WZioM669V_fD1JbMyttPa731gF1CDU2YlAEh8NXDlL3CjUaEiStgRcEfCU0NZCF008ZLxnfQo-usU5Pbj0okr2S1dkgTA29qH5Naw79oMhDOQUmQL5g0Y5FX87fm4FANiFLJ044eRjPT4wXvRZ1PZT-BaTx5IYq0EPuF4RKGPiNzoIdaPLYpmF8jyc00AtHHmwzxWU-74V007M1eAnbHp0wtX8shcMtayWU8lM11KEB38mGGN-ct0xfKmqROnSBI5SCF0kdCPDStApBC__rUuwyOn-qvj9Hjkcd81zCUeujHtQRe5kznl_CrwF5eRgZL4CtNmcnAFxXNLBkNzE8Uhwg4XxynmWMFyx54hiVvXqEqY9exjwk1iHh-Kmrk_cuY-l_Gu-yve3s9SWnEkdlP7PepXxu7eSj9ZPaQo4t6Ef1XQKvlLzSA3cL-aPcIQ3OTHLrHtjvX4Wtdg_H_vAovL5wT9qeRO9h8ef2x7foyFmgZk72Aci3wHj2qnV2lfehI89BfluPb3-Td1Lzn9r0TXu6tf0FctVET9EP-vSpPn321tk97xnaKxWwaX4Pesa07uNy8kH5cAZKnfRcVwKEEWr00sMcOqLes-QL43vG90GxaG1dWkUL61rG978Z3xfJMv7ju6uDQVtPsP3xt4cSDTaSoHE2jBA5jZRyIDgIU48bJ-sWQfQ9CufDACwRtHUIpXU1OiUNBoMkqOyg6rBMnaAw40pspYFyBGkkSaHUCORGadopDTTWXSYvzstxXlnjSRjyYX6G3fadehdnyZ7xYlavk7pICjHDdbyK0yxPoryYdesciywqk2KZ59lqVRRLLkSc1VUViabm5XIm1zziaZRFcbyK0rRYiDJtyjrLeZHzVKwylkaohVSLaQOta2fS-wHXRbRK85kSJSo_3Uach-GJpmbJl5_5kvHQnIzzt0ksTIUhlu1mbj0dVzm0nqWRkp78jU6S1HS9BUa2g2_mhJ5kKwhhCCP4400SNi1cQY93ljVwOBy-JoxvgIRrkfxscGr9vjFaSd1QLiqr37rk7WfeO_sPVsT4fqrYh24JRf8bAAD__8_sSRs">