<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=http://email.email.llvm.org/c/eJylVcty2zgQ_BrqgoqKpkRbOuggy_FWqjZ72E0lZxAYkkhAgAFAy8rXbw8o-ZX1aatkgyCJnp6enmHj9Wn3t-n6JJw_CuOEtQ-DOJLoyFGQiURRVe1krbDej6Lxk9O4I1RP6kcUrQ8i0M_JBNJioOF8GzipJ_FAKvlgflEoVvuivCvKfZ_SGHlX3ePXed14m5Y-dNj9wt_2s7z5-bh50PPr8_-PcSRlpLWnHJChHcWEkJmUkpGK6pDvh8klM9CFX0PKYyfBTTrjOmQiRgqKXJIdCd_mQ8knaS9Hl-JLT6cZgFFzxKNHKKmjkE4DLSIt4ixxaYbRUibylGRxXeJQUW34bSxGDaOYbOJrzZjxqqhuXVFtefPyKT84CP30NL8g3oWp3oOpfoPZgtRLTb9BI4kEUqJhTJBA-BHpo1iQxMfflDyXtJHqB2UNGpM4SDMlcTTwh6MHCrDNMxY0nZylGNlPqpeuo4xxlCdeT0KGZ5_ppbiHaLNl3gTnOIDQHiaFUQmFR4z8DEAyZVSUFC4UIYc5F1aGgFhSKZDAoSbHnU2jPcUM50HbynEpoIh3sNhbfJpdP5udpOpnAJOYt_FuKf5iGD6aQVNv4BNr0U8D22QMMG98hXLs6WWHCOX1XI4GeTySmqBHrmkSGSv6s8_CUwNcCgGiY_B6UvSiJ5cvK32fTwDnbNVLuxzfyRd6witafBBRFKvDJL7ew1N78emQly-nkf6BTfAOWw3gES2GAK4TyrJzgMUoHGMwboqoRsvae4fsQvABBd1ymyHSOelIFkqAB2y23wfVX6-FRObMJE6NeDz7GZ4vVreCrZ5vFdXqwDcaaKjwbq6i8pPVs2pzGVgvnLvDTz-lz8bNyQ55_J2nGHoNFUhBCqlzPLD443CIuUTwKlaMDuPiK4kxMXisQPlBOlRCm7alQLiM4hPPnjjxgDwSB4iiR0Rer-qiqtE9qc8UlQ-wClhBYUh0r8aJl9du4-xcikux5-JzWVnCPOWmkeVDGoknnmVTtmK0UtHcQT3Cc49exvKMd0kc09uH89yLy4XerfR2tZULOWEehJ2WDzR0gcgtpmB3b-Y4UpiaJVgwX3xBzssHWPM7omFrYpxyVvWq3paLfldfk270pi51W69XbV3WqpRlq6hubpTc3iysbMjGXVHf4oPzijP2RX23-P8szK4qK_zK9dWmXlfV8mZ9I6tat-VGrja0vSrWJQ3S2CXj8GdqEXYZspm6iIfWRNTi6aGM0XTop0waDJNJNm_-RA2_Pn8N6zvxmUcDta1RBpPrv-feItPdZa7_AjLtknE>53590</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[LoopVectorizer] More efficient vector runtime checks
</td>
</tr>
<tr>
<th>Labels</th>
<td>
vectorization
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
davemgreen
</td>
</tr>
</table>
<pre>
Right now in llvm we generate "full loop bound" checks for required memchecks in the vectorizer:
https://godbolt.org/z/9Ma7qx8vd
Especially for the nested loop case, the runtime checks become a meaningful percentage of the total runtime. They check, for two loads and a store in a simple loop:
`or (and (icmp ult (d, s1+n), icmp ult (s1, d+n)), and (icmp ult (d, s2+n), icmp ult (s2, d+n)))`
We can attempt to optimize those runtime checks in the backend a bit, but will never get to optimal unless we change the way they are generated. For vector runtime checks, we do not need to check that the entire range of the array accessed by the loop does not overlap. We only need to check the bounds for each loop iteration. Not only does this allow more precise bounds for when the vector code can be executed, it is also simpler for the backend to produce checks for.
For this simple case, we only need to check that `d - s <u VF * IC * TypeSize` (or something close to that, minus off by one errors). That can be selected in AArch64 as a `sub x, d, s; cmp x, #C; br cc`. We could also allow the s==d case, but that might require an extra add, as GCCs codegen contains.
The performance differences I measured were as high as 15% with the correct types/cpus/loop iteration counts. And this can come up in quite a lot of places, wherever vectorization require memory checks.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJx1VU1z2zYQ_TXUBROOREq2dNBBlutOZpoe2kx6BoEliQYEGAC0rPz6vgUlf6WeoQTi6-3u27fLxuvz_i_T9Uk4fxLGCWsfB3Ei0ZGjIBOJoqrayVphvR9F4yensSJUT-p7FK0PItCPyQTSYqDhsgyc1JN4JJV8MD8pFPWhWN4Xy0Of0hh5Vj3g6bxuvE2lDx1mP_HbfZG3P562j3o-Pv__FkdSRlp7zgYZ2lFMMJmdUjJSUR3zephcMgNd_WtIecwkfJPOuA6RiJGCIpdkR8K3-VLySdrr1VJ87ek8AzBqtnjyMCV1FNJpoEWERRwlXs0wWsqOPAdZ3Cxxqai2fBqDUcMoJpv4XTNmXBXVnSuqHU9e7_LGUejn3XxAfAhTfQRT_QKzg1OvOf0HHEkEkBINYwIFwo8IH8kCJT7-wuQlpY1U3ylz0JjERpopiZOBPhw9UoBsXrDA6eQsxch6Ur10HWWMkzzzeBYyvOhMl-IBpM2SeWec7QBCe4gUQiUkHjbyHoBkyqhIKVQoQjZzSawMAbakUnACl5psdxaN9hQznIfbVo6lACPeQWLv8WlW_Sx2kqqfAUxiv413pfiTYfhqBk29gU6sRT0NLJMxQLzxDcqpp9cVIpTXczoaxPFEagIfOadJZKzoLzoLzwVwTQQcHYPXk6JXNVm-zvRDvgGci1Sv5XL6IF7wCa1o8UlEUdTHSXx7gKYO4vMxD1_PI_0NmeAMSw3gESUGA64TyrJygMUobGMwborIRsvce4foQvABCd1xmcHSJehIFkzAD8jscAiqv1kLicjZkzg14umiZ2i-qO8ESz0vFVV95IUGHCqczVlUfrJ6Zm1OA_OFe_d49HP4LNwc7JDb36WLodaQgRSkkDrbgxe_H48xpwhaxYjWYVx8QzE6BrcVMD9Ih0xo07YUCK9RfObeEydukCdiA1H0sMjjalNUG1RP6rOLygdIBV6BYVD0oMaJh7dq4-hciqU4cPI5rUxh7nLTyPQhjMQdz7IoWzFaqWiuoB7muUavbXnGuwaO7u3Dpe_FcqH3td7VO7lIJlnaF5u7P-DHt5eOvrkXX1je1LZGGVTf_9fuYgp2_67rI-CpKeEzR4fvzWX4BCH_CwxMTYxT5mBTb3bLRb_f7Np6Xasd6e3Netcs1aZZ1Yq2tFqp23a7XVjZkI3sJz5PbyLEHM4uzL5aVniW69V2s66q8nZ9K6uNbpdbWW9ptyrWSxqksSX7wp-kRdhnt5qpi9i0JoL3500Zo-lQO9kk8OWEvhn2Wj7S0AUit8hB7HME_wHZN3yA">