<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/116931>116931</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
vpermb (_mm256_permutexvar_epi8) byte transpose compiles to multiple XMM shuffles if the result is stored
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
pcordes
</td>
</tr>
</table>
<pre>
For some patterns of shuffle constant, we miss compiling `_mm256_permutexvar_epi8` to `vpermb ymm` if the result is only stored, not used in ways that require it as a single 256-bit vector. The worse version is 5 to 6 XMM shuffle instructions, so it's worse even on Zen 1 or a future Intel E-core with AVX10.
This happens even when inlining into a loop and unrolling.
Present in all Clang versions as far back as the first one to support AVX-512VBMI, and in current trunk ([Godbolt](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1AB9U8lJL6yAngGVG6AMKpaAVxYMQAdlIOAMngMmABy7gBGmMTepAAOqAqEtgzObh7RcQk2AgFBoSwRUT6WmNZJQgRMxAQp7p5FmFZZDOWVBDkh4ZHRChVVNWlFvW2BHfldXgCUFqiuxMjsHACkAEwAzIHIblgA1IurjngsLIEExIEAdAh72IsaAIIr6wybrjt7jj34gpfXtw/3piYBFOeDCrgImFMUAYqECtBGEwmfwAbrD0NsFAhXFQeiRMBBUXh0QAqGYEUjbUwsZYAVgAbHhtsikV4AEJ/bac7YAem5TAULAgK2WnJhMV2a22MTwyAA1ph0SsaRphdsQBLlitWaphQTEXt2fcuTzucjdqsACKUo60unGJjodDGTDSgAcBNIzINHK5vO2LHBQMCwCZ2wi/GImG2BAQUcx2Ko9Ax8OACAItAAnttkAhDMBMApo7HtoEFAxswhYfMKSwmLLgyWCBiM0dMCDkGG2wB3TCMbYIPCp7m0VBdimliEO7aoKhMmKRFgAR2jqDDxFQDtEPSLUYHqe2i67JHQPs5PSDHbQDG3rhOq2WxibYQzrs5eytGlUXFdXA0Gldf4Gpyfojl2YYZhC06zpgTA5geR7ECeRpcueNiXgIN53g%2BTYCAW5ofqoGi/sRRHeshnJUrajLPhC8azu%2B1rUvSxhKKczrSveEDPq%2BWrTkECjEhoFLcRKrJ8QWxJcMJL6ieJAnLNJPHLGJuECaspCnsaWnaTpul6caIm8apxLSOBSkqfxxI0opsnGXSNlGZZXjbEiqyGncfraZR9LURBBZYvRlqMbaLFtsQ7F4Jxn7fr%2B/5/hS%2BlctFP5/gBKrKZ%2BRFZcRCXbMlsVpVqmUkcRxIKXlX4pXF6WssV2VEcS6l/J5iWtW12n5algEZYRJUNaZnXVUVvX1VwVkUoNhU9aNDX2RVMVdTVdUlcSzmue5mlmgxxg2sx87EAGEKqMilQRe6NH%2BdiFJem5mk7Uxdq4pGLF4LaEBktd61/IsXgWt9AL2sCZxghCUIQDCcIIki9yEuidFPZCyLLASaLbKS4IUt5DJMiy7nGqhMrZhhTa3oI96PmZb5BZNgFucB3LbKB4GQTO2wwXBh7HppBPodeJNYRTqn4RVM1cGRdxaVjvm0QFwv3SFrHhS6kXI4ZylyYJDnq8ZUmU45EnlWrFkSU15Hteb%2BlGxrplW8Z1l69rllzbbTkueLxpbUF8t7Qu4KYMdp3K%2Bdfl0Z97tcvygqqtsYoSqsUoyvKiq0ulIrqsKWo6ssyPMutvoM5Gi63pGO5EywMT0JByIlEQxDbLaAC0YSEEyNckCubOqHg26BNsTAYsGSaRsA3cQsQp53btj215CCRvR9OPiz9f33P9DxrBsWxRu8ny2D8qw3KvG/PFv5ofCCRj74fdw5pUYbYosNKsvej9/bd9y33XZJi0/L80m/7lP7TnBJqX%2BoCAF/BOP6JggQIATF2GyTSEZtgQCgXgd8GhzRiXQfse8WCSxai1LjTSBkH5P3Qf/YWjJeJgA4HcWh4c3y/TujtKioZtpT2MCOB0rgXpvQgFebcUtiQTDBFQL6Zt4Yz3euCXWN08YoQCgjYwSMZEEHKvIzSeBZwQDYCwZA5c1G6zJOVe88CwBgAYhoYhZtjQxDOIIKgEANT4CoFQSIjB5ggEfo4BgGdbGJWFN/FByBiBMAAF5ZjoomAs8CGIMNpOce8CgGE0l8f4iW5sgkgJQbBAgrgxBzgXGEeBwsEk0iScsFJSw0l%2BOzhSb%2BDSQESMydpSM%2BTiDljFu/VpCCV4PF%2BhwKYtBOA0l4J4DgWhSCoE4I4DEMw5jbzWDwUgBBNBDKmLGB0XQ4GkFlCAGkXBzhcGzssDQkhVgaDpHSAAnK6OkqxpAjI4JIXgLAJDxQmVMmZHBeAKBAEJNZkyhmkDgLAGAiAUCoHLnQSI5BKBoBhfQKIyBNh5kfMQVwDBZR8DoGPf5XF1mkGbswYgGZOArJJZUDMAB5MI2ga4Ut4IitgggaUMEzESrAYJgCODELQf53BeBYFrEYcQwLSD4EjKUaugqpn%2BxKH7Jl5BBANGVfCMIYSyXOCwESkE7yhWekiGEeImALSYFFcAeERh1lTETEwYACgABqeBMBdhpfOCZKz%2BCCBEGIdgUgZCCEUCodQErdAKQMDa0wxhzAav%2BZAKYqAYhNEFQ3WscwriWhlA0OsmAG6bFdYIbYDcaXxwbgAdX5bwVA1diBnCwAm3ZxRSh2AgA4fongpL%2BBGHkAoegMiJAEJ2gd8Qh0MHaH2roUkW1NBaH0FwtQ9CzrKEMSdnQogzqGCOrdrR11jE3VMBQCyvFcGGaM8ZRKfnbFUPchudJJDZijSGCApwsWyngRAXAhB26PDPbwIFWhER7JAKsI5NyvAnK8MsSQrplheBpF4LwNz9CcFeaQd5dJljnBuTSVYqw6ReFdAhh9XhCOkC%2BTWzgfyAWrNtaCiFEAkBkhiBjCgAjoXSmRcEVgCxb10nvY%2BtFRgUFvuxRMXgCof0Nr0D64QohxCBrkyGtQRKI2kC7GEmITLz0cDGRRq9nAaXglYzhWc/HBNPrzKJzF4mUHOCRZEOOywJN0eBcBrZWAoi7Oeeh95qwvDnHOdc%2B8JyLk0jWOpSj0zqMWFo4BjZIHYPnGI0h3Dtz8NEVWK6VDHBViXolT8gDtrdPLAK982LCXgN1oSHYSQQA%3D%3D%3D))
```
__attribute__((noinline))
void shufstore_v2(void *out, __m256i v){
static const uint32_t by8 = 0x18100800; // low byte of each qword
static const uint32_t ones = 0x01010101; // later dwords get the second, etc. byte of each src qword
__m256i byteshuf = _mm256_setr_epi32(by8 + ones*0, by8 + ones*1, by8 + ones*2, by8 + ones*3,
by8 + ones*4, by8 + ones*5, by8 + ones*6, by8 + ones*7 );
v = _mm256_permutexvar_epi8(byteshuf, v);
asm(" nop # picked %0" : "+x"(v)); // require the complete vector 256-bit vector to exist in a single register
_mm256_store_si256(out, v);
}
```
This compiles as expected,
```
shufstore_v2:
vmovdqa .LCPI1_0(%rip), %ymm1
vpermb %ymm0, %ymm1, %ymm0
nop # picked %ymm0
vmovaps %ymm0, (%rdi)
vzeroupper
retq
```
But without the inline asm blackbox between the shuffle and store, Clang spends 5 or 6 shuffle uops to feed two 128-bit stores. (This is obviously *much* less efficient; `vpermb` is single-uop on every CPU that supports it. At worst 6c latency on Zen 4 for the 512-bit version, but this is the 256-bit version so 4c latency there.)
```
shufstore:
vpshufb .LCPI0_0(%rip), %xmm0, %xmm1
vextracti128 $1, %ymm0, %xmm2
vpermq $255, %ymm0, %ymm0
vpunpcklbw %xmm0, %xmm2, %xmm0
vpunpcklwd %xmm0, %xmm1, %xmm2
vpunpckhwd %xmm0, %xmm1, %xmm0
vmovdqa %xmm0, 16(%rdi)
vmovdqa %xmm2, (%rdi)
vzeroupper
retq
```
Or if `v` is mutated before the shuffle+store, e.g. with `v = _mm256_add_epi8(v,v);`, the shuffle choice becomes symmetric between low half and extracted high half, instead of using `vpermq $0xFF` to broadcast the high qword.
```
shufstore_mutated:
vpaddb %ymm0, %ymm0, %ymm0
vextracti128 $1, %ymm0, %xmm1
vmovdqa .LCPI0_0(%rip), %xmm2
vpshufb %xmm2, %xmm0, %xmm0
vpshufb %xmm2, %xmm1, %xmm1
vpunpcklwd %xmm1, %xmm0, %xmm2
vpunpckhwd %xmm1, %xmm0, %xmm0
vmovdqa %xmm0, 16(%rdi)
vmovdqa %xmm2, (%rdi)
vzeroupper
retq
```
There's no correctness problem, just performance; I tested with memcmp in a test `main` in the Godbolt link.
The extra instructions take more space than the 16 bytes saved by using a narrower shuffle-control vector. (18 bytes of code to load a shuffle control vector, `vpermb`, and a single `vmovdqa` store. Plus 32B constant is 50 bytes of static size). vs. the version with `vpermq` being 42B of code + 16B of data = 58B, the other is 1 byte smaller (not counting the extra vpaddb). So not appropriate even for `-Oz`.
(The best I was able to do by hand was 53 bytes, using `mov $4, %al` ; `vpbroadcastb %eax, %xmm2` ; `vpermq $255, %ymm0, %ymm0` to get something to add to the high lane of a broadcasted 16-byte vector to generate an input for vpermb. `push $4` ; `vpbroadcastb (%rsp), %xmm2` is also 8 bytes if restoring RSP is free. xor-zero + `vgf2p8affineqb $0x04, zero,zero, %dst` is 10 bytes.)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzEeVuXorrz9qdhbrKmFwTxcDEX4SgqKuIJb2ZxCAcFgiQI-unfBdo9PTO992_fvOvf3auVkFSqKs9TSVU8StO4wPgHJ8mcpH7zapaQ6kcZkCrE9JtPwvsPnVSAkhyD0mMMVwUFJAI0qaMowyAgBWVewTiogAaDPKUUBCQv0ywtYsAN-Z95DqXhzxJXec1we_Oqn7hMx9yQB4x0HW7dKx_c87xrSyPAEgwqTOuMgZQCUmR3QBmpcNjNURAGaopDkBag8e4UsMRjoMLXOq0wSBnwKPAATYs4wwBKw-9-ysANB4xUbwBsEwwaUlEMbriiKSm6GaROkSE4WtaHVWlBWVUHLCUF7WalBKSMgyP6Go1vuACkACdcAAGQCnggqlldYWAWDGdA-x6QCoMmZQlA-6PAv3G8yvHo-X-bpBQkXlnigj5FNQkuQFpkadF5LS0YAR7ICCmBV4SgLiqSdf78Tcq6whQXrHOEl2VAybwifjeLdm6IvAr4XnDpvnc-jdKKMkAK3NlL67IkFeu0-y4JcC9bZmdoN11agKCuqk42q-riAjg45iTZIKFPMsZJKgfHCWMl5UTEQZ2Devx89UaquG8QH5yIVu4iTpAuX0PJVlrkNmtXtpRc3oTyAumCrQTIW2sKsh6NzCw0QnajM8uW72gTT-bMdjV0p6nP20ektGNZRnMiF1uEdmhaNsi6Id3dOmwWC0ie7MbZbDG8oyI29sYQWfPSQ_vWtWwUZuYKWUVs5Ui-j2LZyK0tLhFaXdFVY_Fj5SejTWBfIIp1mTjNZrcdHvLlaWbHG6tFdjkqJT3XTyd1ddjL6iUZnKYbJdnvl4dMvx2gPI2y8BgrO_2a1q11plOkXJDWoIc5Nf0sRoJZV-cipgtT01pTQyFKRtCkHjLNatjc_erMRuezOIjL6MgaDupimbqyvsKqUsVmrls75F41hcmG1uRRg_xK5ZdUR8lRt3G6sWR5d0x4G10NV9ud_V1DTy7ax8ifJoyml_0AaTMO6r4XjFyEc2mLdgsUz-GhmCXWHnoQWtvmjpBQJrxDvU2ZhNSWF1nKDl6Z2FvtyGAU1NtHHZzDK0XKfEeGfF0YszMJVWOrOjEK5KX5MJfTuXRbwcVUti2lUU6wSTmoG2ZuQBnZu6CFczJB7XiFFGQyhArGLihRLdlCJRlNGdF01NImcSNr7jT5YhHfD4qcWnxoKerFMGAji1vF8l1zrZSjnN_LYZpnqbYaeWSeWPl8o8doA89XVW11W2Xa2lns3OkVw3Q4mk3tNZXWjjr1bdUQq2R1hYc29V37UVEJ35mR7Y56eES8kYXRwCjvETln0HVPicIPH2W8RVKMNyNNwZsyqPEyPcbmUhJjDkpQ3roKOkgkuhKjs5luNlNMk2K2q7zzYp0oY-9R9x2Fandz2mQ-Dx6Vc8Oaur727Ydqe9GUs5YQfRFs5tVsbB_aWWDNFsMUz1CxWM7W04UxqzXFo6VSUO9QeNuyzoaBZ-Zjr2wXzkp2LteobU2elLQwjs4yM06n49nX0uvdKIKo8k6byWK3keWTe9tkV8v3Fjqj9kgfzNpiFFW9KhzUJQ7qidM_ZQHT191zLM-r-alv2-inYKMcfSuj-3J_zXeXYNOa82N2HJRHfXigdA_3WuAMMw7qMyk9sIMAC8nLYu3kbma7pXcqjvvd_nYU9s3-siPLZDfxluoRbvbWPtzuw10WOI-8xlKe5XHbxpC245GGDG1-tdLM3uDhctqrEqb6DOdSfhnt2vA6KI2FPlh4mnxh4VDgoG7SY3xgE7SAVDm1caLsTFtVAtO8pJO7Vqehvj4p98UCOd7CnzsDc8ckdWal91FhXBOrOiXJ1hOJORmu85MvyadlrNeBLC78uTww7O0KNs1RTkbTspTXJMSzpbvZXovBZm8JgbEJW29_vuW8rTc30dGraeIML0dzcLpnd3emFtfxaZfvBpJWZIdKvzg7QZoy7D89b7RVmR3GJyEbTqose_j-9lLjRVv6O33MRnY9mLtWWIrzpyeijD9NRiy43ePrlVGXOFe6I_fbPL3ydWZWVzo_kCGl97VVsxuxJCMdiqWxkrR7Y235mXw3M81uVX_hIYV3HU27L4OLUvur6zSGQ3lJt8ilhuNtKBIDm9qlJPrsDAVJUTa3zWS1Th-TSBTXE7WtZHpoN_W4uEN1fG3E-c3goL7WbxKJFGVzHg2isJHKneu7hC7n9IbPky5IhZO7KB5Ps5TnoL4Y8zkH9VG2Hm0LeXY99nYSBXVbjrxaD2dxbJtoq0PjvnW0E4ttJT7G4dhIHpv0aEcUjw-K03qpsDhvHfkyldaiHXFQdw_nGZ7FaDANDskgEO8c1Lf1mc1TcmLlFlrKSuar-Fje7FjHfrBjqb1VZd1eDE9s4D5EXCvHw5Ifu8RW4rPbONZJ06bzm_lwsHxqbNVVmhOSAsEYPmarPR6PlUaOkZsQNzmxs92qK9OcxxpabgcK0W3HPMtSHGsRGcixEfrG7KAvOajr8l1OZQshXTqdCckRHSuGaS3LVYDHSmxofDaOBulBovElntm-jOIqbuVHfDQ0LDdUu_CpE1B95jR81rtv9UicJ75Ux9Yc6x6WjvWMYamanfWag_q-VJT9YK33zUkzd5m8Qnw2mA3U-JA4CtEG5kDVrEY6-4-D2uxiWV7K6jxfGDJ_tAfGRc8UZFyO8iMJjAelqn1J9rGnllMHoWI-XF2ubYnO5mBmHuMFmqmtne-t_DSVMZrLSJstT5d7kl_UeIEMlKKdQqaLwnZMSZ7LRE4ie27e5WUiXSLL3jeqfNqMVWKsI-moC_nF8GnjxsZpharZ7LRTLSRLlgljb7BjAzNlcRx7O0t7aE6F3FmsrFRtYUeSKGPZ1SvNDexm4agD7TzfeXV8LYveEe18OpplkiyjpbGPFMt0nftxpTSac9FG9-SAL6mhYTNHC8fV9QBhbZMI2Vbbao2LlBjJVyxbcpiU0Uo5zR-9TEWRNcMM4118UhRtp5DQ1qoQzW1L9fimTR7Iu_UdT2juXpGbLDXdFg80aKr0kGSIX9EcGUNfD13Td9XAO7aNH6PwKPnNHglhKhdKg3Lx1G6cBKIYDSJSxOWilykb0wsiOLbSGdJmMA7HdsJbU28KK7K7HISl7E15vWH2RHlUc81ywsIm8eNqKKsqrDaCcNbEvSXbyv2mB4bhWePTZh4Vvr5bB0kdzmbqKT07hoLKhh7uRbxBR5TIcLhW135j6odE24xmaE5NaXm_oWI-tqhjV2WWYLnUB4tm-ZgogY49O5TCWXYu5Hrr7N1rM9xgPURJMjkmyuqiPMF7eMTRHR2qJfOUVNfQRb4kqXxWYnQmRyffBNo-Vlpf4Iub249g-ibe6be63VjHWCERv6x4dTiwSdIqcnW5G8zezE14UUaGlpvbxYMPVGOzuU4K5B3jzH0kh4CD-lRerqvHfPbQB_luulI2J221ujezJY_lQK5OB0TS0QrjZCJdm5taTFdh0M8_nt0XF_58WJXOmX9crYNOBvL5ZE7ddWGYfCKq7DFeavvDfLhXp_I-PZJsO45VVoXrBZpPxnChHOOlQJyp69g24qAkqp__Tbq_T_kCN-Rff_3jz58eY1Xq1wz__Nmd8OG4IH0Kgj-PvZE07NOiPgX7eYMcHPdtHESk7tO-nz-7DC8Ft27QSH6OAwAAyjyWBs8UEdRpwUT4kwH_PgaAE1XAt8JY4Pkxz3OiDMAzlQAZaYB_Z7hLMrEXJODakCr8X0JJgelLKC88f38T6jFcgbCTREGMWZ8PURyQok8qMQvefp-UVsFvE7_b2HXqvNHP9UptKWZ9Tit2vums46DcK8RBxHfi_2gTvmiDX7SJHFR-2f3ffv4QMfhCrPRF2_CLthHo1lP8tJ63z1b_ldB3tj-d0wm7_TnYo3mPMghAQUrAQRGUaXDBHZIkvmvnRAQ4CDkot_3H-PYCYreQ7yv5nul3CxiQvMwww6_8_o90v8tycZvSZ4b8XhKocJxShqsPxX5p-L6cPdBpCqUhB8cvjP9mDjdSv6TUp-z-WQLBfR6O2xIH7Fm--BdC_kYyEf2-8rec3MKrB94WytoUfvK9K6UqLXsXKZ0P73kuvAa9SiqvVv5zh4-v_O8TdGsCejd_XpdP_ToNvJL-LrNXIkw_osWHug9ckbosP_m5_6kwu_6L6-Sa9TUTUj8Z-gxHHXSAn3nBxSct8DFrMC6eDH6Va7wifJaIOq2ehRBa4iKkQAKkAsOPjjUpaYeLCOMQsIYAAY57yPSj6Vtn_7hfwJQC4t9SUtPs3sW6vA4SDiKQYUoBjqI0SHHBOmh-lLD68hV9Ae17TUpACoBvuLoDZb17FqpeZRcKUvYGAGJ9PYmBYdCHqCK4v1eVBiDqIJxgIAnwheq-tNMztffPU8uuyy_gP4talIDBL4kswRV--_fd4AN8fyOv7N75T-TxXyKv_YWx9hcI38fjllVewFIBjp8AG_yGwo9x8DN4r31HKEl_d_0bu7eyLsrgkvkNeGH4D5XgZ0W_HNqE_zBU-FvB34cmTfivo_6c8EXkT0OE4b8Q6bfu8P8L61YVSKMexi8E5zXzGA6BjyPyCrUvAnFQ_uAZfovfnhXObujnrcELw_ct4cZB5T12DntrP_M2SEgaYODjgOSYAnrPc8yqNPggeXcYSLws6hn-AhIOQZLGSd_eCUwLyrAXdlt3TV9F5xeGODjgW11_1Zr9inhh4NFnbOll9Fv8238Kyi-nfMEPLwy_irX_gtf_Sok_qfR5E_hHKv4F0yd_v2DCXxj9x77CP-v0JYGEL6f5ZwL9z6H_lyz6H_TZduG1vxsoCAhIVeGAFd0uUVbEz3DezXauKQMlriJS5V4R4G7fMAHDtENzz6Ec50FePo8qXXuH4txLi56Sz93uVYAHWVpc_rhPwE9y_HZvAZh3wSDvGExLL-h47D0FCcPnORZQ79bR_P4ijgcKr6pIg6t3hn4PSMEqkv26QeHgWBi_hpMIBCTsrxMy4oXdIevXrdCnYb2_P-2S7xcNH4ey7uVziTpze769AbDOagpEKH_cMfWXNfyvyV-pAE0fXcryBsCNvvUGvu-DH9GpDwedbB93hg66U-5L-e7EKwz759BjXh_HpLH8HqpIt3t2EwvPBIHmXpbhCvTpEgMBqQvWiWQfa_AMCE-FHNJfWHllWZGySj32ujrq9nZuyH9fPbgh__Z-CO3PHl04pAyYoPEo8Pysd29IulVKOqd1zZL4dEKn5EfMy8mtiySDF2m8rLP344DyEfx6dmOv_UzLzx3_w-Y75EGnVJdHUZJjlvQOIMALw-7jI7pmXtFnVN6v0ItDIAy_9578dUqPcYGrzjdeAdKirFnvnydcOucM-bKmydO4f7SppzP9MxQ-NzQvowS8ozaNQIU7jHVab5x11yGqcIe4llTfO_b3oOimiCNYjr0oSgt89Z_bCd87uOvFQeX10U0XUvaaTXhB9M8j17fwhxhOxIn3Df8QRiLkxwNJHH1LfoiiMPbxwJ8Ik3A8FIMJDPyRJ0VhOBxEYhR8S39AHg4EAfL8UJTE0dtAHAU4jMbCQISh5HvcgMe5l2ZvWXbL30gVf0sprfEPQRhOROFb5vk4o_0dL4QFbkD_tkuyJPVb9aMb9N2vY8oN-CyljP4Sw1KW4R-vfIKD439M_iZPerDKK2hJKP6VATEC8jpjaZnhz3es9O-r3uct77e6yn78cb-YsqT23wKSc1DvdHt9fC8rcsYB46DeW0Q5qL9Mvv2A_y8AAP__tNObwg">