<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=http://email.email.llvm.org/c/eJzNVsuSozYU_Rq8UZni4QcsvLCn08ksOrOYqcrMyiVAGMUCEST8mK_PuYK2cXcvUpVF4rLN1es-jnQOynRx3XyrBCu1UvosmwMzsm6VYLkuBGs7XfS5MIzDFNZemWxEWcpcisYyboyoM3Ud5p6lrZglV4ofDPNWwfxLzOY1P10iPMqas3lZcmPRZSsMs3PFrTiJzq3C00jdMF2yT4ojj96IwmffKmlYzhuWCWaEwHjDvOXuV13stLLe8smLksra1njx1oue8T3oIsOQr7sDWj_xewl5Vi5KkXhR6nvBkxdsx_9VMHzztp1T0KH7pGWBmuq2t2KndH78CmgQp9B9poQXbVlmvOgTm7T5m3ZGbaUdnHezupvnm6kKPrGxMB2TW-8Gg-EzuGYGpS9QNPNi_Na7wA-89ZMXT2aOQavz_iRyu6_5ZZgcJZTJlnLYUvQUwCxgL26r7z5K3WG-8wRHbn2A_93Y-vTO_zgS7Wjq4lYCGz9Tf1NvztfC2bQ22sl3S-lDVUtX9RAAaKOjOrsFdAK2wPuh531JGTkJbtABjVs72rkIkXOVuvY4Hk7H42H87hvAP56iaUziFO-txlmXOSOodCd_ooHzO7KKPb9smW5FN_SedXck_sH8_vLCOnGQxoIUoByevCBm_MAAbwomQT5lNHGEiHbqmzY_Vm1BrKLpXZ87nwCeNZrxtuUdMbYT3OjGjLTCl5iXcyPcTK7UjYZlp-uRiEs_YFaPjTD2A5_9rq3AWm6dA8oCyZGJBPb7TiADmVvK5iiuKKzAAT43SMWykzTSaUYFd8It0r0F24T_Fr9O5JQ0qmmO7_UB9qgCtF1wUAOVqAyTLIgK2kjT57kQBeVOGf54ADXrLaFY8yPUbSyg5s2ViUsrGiNPA7DYnRo-o4RqIeRvaKt3aJs3BQwIt9pQwYLSgDfgXENN6z6vIGkWubw5HL2hQ_D188sTfANG5JJjw34TnSB3UEJx4U6hUXXzsA9K6xYio3tVkFz2TQdRBwA8A8LgmZU1isWxQSoASx64HTZAwWjy6-sm0rmc1uVO_fZD4RybUSybXPV4C4DRl2Q1ZO5XXvzLdNk7YX3G6-C_Fdb9vo6Wq4IZpwv7mlp7I2y4x_5GSTBl_P9VXh9yx3u2KIbkxx6ledEPPQTnoJMpQfPBhOxhgnkQvClooASYVbevofdvo4qL7XhuiTtDv9vZ4DHyx5NCN-kR90f9poD5yZpiX64WQ8x9dcsAaZEb90j_uWT_UeF6wW8awTI6oe5GhCStOFyJf9CIHjp5feUy6QsocxAN6ThamIM7jZNz4zj6VmUMLhMM1KsQYsqyqc63PD_CF20MbYHPniHPtSYJaCjqIBUDry0JjRsjZilxeRUIwgAPUveO_AJAmvvRnclYBNSQ2BLJ-fCDvr96yPhA_-d1kAbrIIlhNn3dXucCyfb1HLJhoO7azun2NzcVXjPFvNHNPNeN21o0-UWOPsIwTCKIxd1M2VcKPf8yxia5tP6s2MRFGqd8hldopbvNT9FW1w5rZn2nNm_ue4jcZ2PSSp1eH3O8Zv8Eh9CUxqAYGMt4Gaxn1WYp0nyRBFGSRnGcBeEq4-EKN8T1MkzWggczxTOhzAaQeVHUiDNzLmADvdm_z0BuoiCKgjAOo3iZxmt_sUyD1TKI87hIsoQvvUUgai6VT37oKjvrNs5l1h8MBhWOqLkP4iYuD40QLmFkaKVVrvHwbiH2fG4mF_jxRI4vm_u1AV-ION1TcPfHBpuZy3_jkv8bUlbKRQ>53507</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[vectorization] Innefficient use of SIMD instruction on x86 processors
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
zephyr111
</td>
</tr>
</table>
<pre>
The following simple code produces a pretty inefficient assembly code with the flags `-O3 -mavx2 -mfma -ffast-math` whatever the version of Clang used. This can be seen on [GodBolt](https://godbolt.org/z/M1abf4fe8).
```cpp-lang
void computeBlockSlow(double* bs, double* ba, double* bb, long si, long sh, long sw, long lda, long ldb)
{
double s[4] = {0.0};
long sihw_vect_max = (si*sh*sw)/4*4;
for(long ihw = 0 ; ihw < sihw_vect_max ; ihw += 4)
for(long i = 0 ; i < 4 ; ++i)
s[i] += ba[ihw+i]*bb[ihw+i];
bs[0] = (s[0] + s[2]) + (s[1] + s[3]);
}
```
The automatic vectorization produce FMA operation working on XMM registers instead of YMM and it also use `vunpckhpd` instruction for no apparent reasons. This is the case for all version from Clang 5.0 to Clang 13.0. Note that the use of the `__restrict` keyword down not visibly change the outcome.
The recent trunk version of Clang on GodBolt (commit 2f18b02d) succeed to use YMM registers but it makes use of many expensive `vperm2f128` and `vunpcklpd` instructions.
This is possible to perform a much better vectorization using SIMD intrinsics. Here is an example (note that the loop should be unrolled about 4 times so to mitigate the latency of the FMA instructions):
```
#include <x86intrin.h>
void computeBlockFast(double* bs, double* ba, double* bb, long si, long sh, long sw, long lda, long ldb)
{
__m256d s = _mm256_set1_pd(0);
long sihw_vect_max = (si*sh*sw)/4*4;
for(long ihw = 0 ; ihw < sihw_vect_max ; ihw += 4)
s = _mm256_fmadd_pd(_mm256_loadu_pd(ba+ihw), _mm256_loadu_pd(bb+ihw), s);
__m128d tmp = _mm_add_pd(_mm256_extractf128_pd(s, 0), _mm256_extractf128_pd(s, 1));
bs[0] = _mm_cvtsd_f64(_mm_hadd_pd(tmp, tmp));
}
```
When a register blocking strategy is manually performed the generated is even worse (it makes use of slow gather instructions instead of packed loads). For more information about this more complex example, please read [this](https://stackoverflow.com/questions/70907083/numpy-einsum-tensordot-with-shared-non-contracted-axis/70911182#70911182) Stack-Overflow post.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzNVsuSqzYQ_Rq8UZni4RcLL-yZTHIXk7uYW5XclUuAMIoFIkj4MV-f04KxsWcW2SUu27Re3aeP1AelOr-sf5SCFVopfZL1nhlZNUqwTOeCNa3Ou0wYxmEKay9M1qIoZCZFbRk3RlSpuvRzT9KWzJIrxfeGeYtg-j1m04ofzxEeRcXZtCi4seiyJYbZqeRWHEXrVuFppK6ZLtiT4sDRGZH77EcpDct4zVLBjBAYr5k33_6q861W1ps_e9GqtLYxXrzxohd89zpPMeTrdo_WO36vIU-LWSFWXpT4XvDsBZvhfxH036xpphS07z5qmSOnqums2CqdHd5ADeLkukuV8KINS40XPbFRmz-0U2or7ei8meXNPF1NlfORjYXJAG657Q2GT--aGaQ-Q9LMi_FbbgM_8JbPXjyaOQQtT7ujyOyu4ud-crQiJBvCsKHoCYiZwZ5dV998FLrFfOcJjtz6AP_bofX0yf8wEm1p6uyaAhs-Y39jb87XzNm0NtrKT0vpQ1lLl3UfAGyjozy5BXQCNuD7rudzSik5Ca7UgY1rO9q6CJFzlbj2MB6Ox-N-_OYbxN-fonFMqineWY2zLjNGVOlWvqOB8ztUFXt53TDdiLbvPen2QPUH88_XV9aKvTQWRYGSw5PnVBk_McDrnEkUnzKaaoQK7djVTXYom5yqiqa3XeZ8gnhWa8abhrdUsa3gRtdmKCt8qfIyboSbyZW6lmHR6mooxLkfMKuHRhj7gc9-11ZgLbfOAaEAODIBYLdrBRDIzBKag7ggsRwH-FQDimVHaaTTjBLuhFukO4tqE_4jf63ICDSyqQ-f9QH2oAK0XXBQgZWoCFdpEOW0kabLMiFywk4If96RmnaWWKz4Aeo2JFDx-sLEuRG1kceeWOxOBZ_RinIh5q9sq09sm4cEeoYbbShhQTDgDTxXUNOqy0pImgWWh8PRGToEb99en-EbNAJLhg37TbSC3EEJxZk7hUbW9d0-KK0biIzuVE5y2dUtRB0E8BQMo86srJAsjg2ggCy557bfAAWjzi4fm0jncpyXO_WbL4VzaEaxrDPV4S2Aij6vFj1yv_TiX8bLPgnrC14H_62w7nZVNF_kzDhd2FXU2hlhwx32N1oF44r_v8rrHXa8Z_O8Bz_0KM3zru8hOnudTIiaLyakdxPMneCNSUNJoLKq5iP07jGqONuWZ5Zqp-93OxvcR_56Uugm3fN-r98UMDtak--KxayPuSuvCACL3LhH8u8l-48S1wt-1QiW0gl1NyKAtGJ_ofqDRnTQyctHLZO-oGT2oiYdRwtzcKdxcm5cjT6qjMFlgqH0SoQYV9lY5xueHeCLNoa2wGcvkOdKkwTUFLWXir6uLQmNG6PKUuL8IRDEAR6k7i35BYE096s7k7EIqCGxBcD58IO-vzvIeF_-L8sgCZbBKoZZd1VzmQqA7aopZMNA3bWd0u1vakq8ZvJpretppmu3tWjysxx8hGG4iiAWNzNhbxR6-n2ITXJp_Um-jvMkTvjESqvEGsjvRJKOwbd6dBMdqB1U8_b-wxdqRC9cXGKB1Ey6Vq0f7otA3qVD0kodPx5TrPoLUdGUxoAMGPN4Hiwn5Xo5W3DgC7OgyPg8ThcLzpM8CYNgEYRpmEwUT4UyBNyLolqcmHMBG9Anch0FURSEcRjF8yRe-rN5EizmQZzF-Spd8bk3C0TFpfIJB11kJ-3aQUq7vcGgwgE1t0Hcw-W-Fo4n8o97R6nb9btoyksLoicu-NqB_wfTb7T4">