<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/140558>140558</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Vectorizer chooses normal vectors over vscale on grace
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
gbaraldi
</td>
</tr>
</table>
<pre>
While looking at ARMs blogpost https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/whats-new-in-llvm-20
with @giordano we realized for some pieces of code that LLVM is choosing to use normal vectors over vscale and we found that odd, specifically it vectorizes with normal vectors on the main loop and with vscale on the epilogue.
https://godbolt.org/z/GaT958xhE
```c
#include <stdint.h>
int32_t sdot(int8_t *a, int8_t *b, unsigned N) {
int32_t total = 0;
for (unsigned i = 0; i < N; i++) {
total += a[i] * b[i];
}
return total;
}
```
Digging deeper I was looking at just loopvec https://godbolt.org/z/orTGbvo8f, it looks like it's not considering larger factors for scalable than it is for fixed size. This manifests in the debug as.
> LV(REG): VF = 8
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV(REG): RegisterClass: Generic::VectorRC, 6 registers
LV(REG): Found invariant usage: 1 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV(REG): VF = 16
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV(REG): RegisterClass: Generic::VectorRC, 12 registers
LV(REG): Found invariant usage: 1 item
LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
LV: The Widest register safe to use is: vscale x 128 bits.
LV: Found feasible scalable VF = vscale x 4
So the widest vector it does is vscale x 4 while fixed size is doing 16 lanes. This is a bit puzzling because each vscale lane (in aarch64 at least) is as wide as the normal neon registers. I wonder if it the vectorizer doesn't know that sve has loads to multiple registers at once or something of that sort. It thinks that vscale x 4 is the largest possible thing to do.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzUVk2P2zgM_TXKhUjgyHY-DjlkZppBgbaH6WB6LGiLttWRpUCUk2l-_UKym3S7QBd7WGA3CGBbIp-enkiKyKxbS7QT5Z0oH2Y4hM75XVuhR6P0rHLq--5Lpw2Bce5V2xYwwP7pI0NlXHt0HKAL4cgi3wt5EPJQu74frA7fF-j7Re16IQ_o-_l1fB4dWchDJeQhOGd4zq4JZ_Q014o4zQt5iNjR7Nxh4Lml81zbuTGnfi4zke3POnQgiqzVziu0Ds4EntDoCylonAd2PcFRU00MroHaKYLQYYAPH14-gmaoO-c47ig4GJjAOt-jgRPVwXkGdyIPJ67REKBVEb9xg1UjiFNKyHvgI9W60TUa8x10mJz1hRgSwV8xLYSOoEdto57HETgaTgtNBnTUxrUDLUS2_7O8rVOVM2HhfNToIuThEZ-35eateyeyvVhl47-OHzLXtjaDIhD5PQelbVh0Ik-G2V7bkMuvAVi5IORG27D5GkDIPcaN3T6r-DnYFCcKPgm5BbG-E9keAOAHSHABDYj8ATKR_5iMpyDk5uqrrwbp9R4-pVch79L_Z9z4mzDlXfRCUd5pUT5ERlBNH7elxPphevMUBm9H53F-nLtKM-7-QbdtPHtFdCQP7-GM_HOIfxs4pCM6UQ1_dwTOPz9WJ7dpknLJ75XB6FcCHYRcM1gXoHaWtSIfVzDoW_LQ4BgXKV5rNFiZFKQ2ouhxvNFvpID1hRbw3GmGHq1uiAODHqNFUTW0gByjBeL28nfw4UXIzdO7RyG3It_DyyFpvxHZ_peZQ4rpHt9gYGwpDknQgfq_mj5RqzmQvzfIUQ54JEte11GZfP858vdP91GDHPxky_8U5iWlygiz-h3MSFzbE3qNNtzoL_9d-pOUy9X_S8ul_K-Ime_huSP4Eot9uE4CY0M_SrFOkFNNfIOl3EClQwrwEWDk2xCyjilzzZ3pcK6exZjtn11KlPO45FiNY4opRxzz7GYP53TZ3ZIuTisXc3a5AoOWeEpDzYCRFRyHy8VEg4pqjOwJ62tBjx6Qyisg-rpbFbG6GEIOseJFEE684jNynG4MS87ehFvEAuWsIg-6icSj5fWu8WkfVsh1gFfrzuMNxSeCLhU1VByF7QcT9NHQDTZScbYmmK7L0MVtuGYCcD4s4H1cTNtXHgd_UkqPhFMp4wBHx-NhjCjBgXILmKldrrb5Fme0W66L9Wq1zsrVrNsVWOTlqsmKZZnjptzIYrWss7reruoMM7md6Z3MZJmVy-2yzIrlaoFrksW6wmyNuCzytSgy6lGbRWwKYjWeaeaBdssiK8vNzGBFhlNjI6WlM6RZIWXsc_wudRLV0LIoMqM58A0m6GBo93KTN3UKxL_rEJyF1mNNs8Gb3S_3hQ7dUE2tUFxkesyP3n2jOgh5SNRiszNxP-3kHwEAAP__kXwV1Q">