<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/90748>90748</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:X86,
performance
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
RKSimon
</td>
</tr>
</table>
<pre>
For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.
But if we have the (v)pmaddubsw instruction, we can zero out the off/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.
```cpp
__m128i _mm_mul_epi8(__m128i x, __m128i y) {
__m128i m = _mm_set1_epi16(255);
__m128i ylo = _mm_and_si128(m, y);
__m128i yhi = _mm_andnot_si128(m, y);
__m128i lo = _mm_maddubs_epi16(x, ylo);
__m128i hi = _mm_maddubs_epi16(x, yhi);
lo = _mm_and_si128(lo, m);
hi = _mm_slli_epi16(hi, 8);
return _mm_or_si128(lo, hi);
}
```
```asm
vmovaps .LCPI0_2(%rip), %xmm5
vpand %xmm2, %xmm1, %xmm3
vpandn %xmm2, %xmm1, %xmm4
vpmaddubsw %xmm3, %xmm0, %xmm3
vpmaddubsw %xmm4, %xmm0, %xmm4
vpand %xmm5, %xmm3, %xmm3
vpsllw $8, %xmm4, %xmm4
vpor %xmm4, %xmm3, %xmm4
```
llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyMlU2P4jgQhn-NuZRAiZNAcuAwwLAazR5Gi1aaG3KSSuIdf0R2he7Mr185NB9N06uVEJDy-z6Vsktl4b1sDeKaZRuW7WZioM669V_fD1JbMyttPa731gF1CDU2YlAEh8NXDlL3CjUaEiStgRcEfCU0NZCF008ZLxnfQo-usU5Pbj0okr2S1dkgTA29qH5Naw79oMhDOQUmQL5g0Y5FX87fm4FANiFLJ044eRjPT4wXvRZ1PZT-BaTx5IYq0EPuF4RKGPiNzoIdaPLYpmF8jyc00AtHHmwzxWU-74V007M1eAnbHp0wtX8shcMtayWU8lM11KEB38mGGN-ct0xfKmqROnSBI5SCF0kdCPDStApBC__rUuwyOn-qvj9Hjkcd81zCUeujHtQRe5kznl_CrwF5eRgZL4CtNmcnAFxXNLBkNzE8Uhwg4XxynmWMFyx54hiVvXqEqY9exjwk1iHh-Kmrk_cuY-l_Gu-yve3s9SWnEkdlP7PepXxu7eSj9ZPaQo4t6Ef1XQKvlLzSA3cL-aPcIQ3OTHLrHtjvX4Wtdg_H_vAovL5wT9qeRO9h8ef2x7foyFmgZk72Aci3wHj2qnV2lfehI89BfluPb3-Td1Lzn9r0TXu6tf0FctVET9EP-vSpPn321tk97xnaKxWwaX4Pesa07uNy8kH5cAZKnfRcVwKEEWr00sMcOqLes-QL43vG90GxaG1dWkUL61rG978Z3xfJMv7ju6uDQVtPsP3xt4cSDTaSoHE2jBA5jZRyIDgIU48bJ-sWQfQ9CufDACwRtHUIpXU1OiUNBoMkqOyg6rBMnaAw40pspYFyBGkkSaHUCORGadopDTTWXSYvzstxXlnjSRjyYX6G3fadehdnyZ7xYlavk7pICjHDdbyK0yxPoryYdesciywqk2KZ59lqVRRLLkSc1VUViabm5XIm1zziaZRFcbyK0rRYiDJtyjrLeZHzVKwylkaohVSLaQOta2fS-wHXRbRK85kSJSo_3Uach-GJpmbJl5_5kvHQnIzzt0ksTIUhlu1mbj0dVzm0nqWRkp78jU6S1HS9BUa2g2_mhJ5kKwhhCCP4400SNi1cQY93ljVwOBy-JoxvgIRrkfxscGr9vjFaSd1QLiqr37rk7WfeO_sPVsT4fqrYh24JRf8bAAD__8_sSRs">