<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/63460>63460</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[wasm] should vector locals be promoted (?) to live on the stack if indexed
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
kg
</td>
</tr>
</table>
<pre>
Arbitrary indexing of vector types appears to be suboptimal in trunk clang even with -O3. For this toy example (pulled from our codebase and stripped down):
```c
#include <stdint.h>
#include <stdbool.h>
#include <wasm_simd128.h>
typedef void * gpointer;
typedef int8_t v128_i1 __attribute__ ((vector_size (16)));
void
interp_packedsimd_shuffle (gpointer res, gpointer _lower, gpointer _upper, gpointer _indices) {
v128_i1 indices = *((v128_i1 *)_indices),
lower = *((v128_i1 *)_lower),
upper = *((v128_i1 *)_upper),
result = { 0 };
for (int i = 0; i < 16; i++) {
int index = indices[i] & 31;
if (index > 15)
result[i] = upper[index - 16];
else
result[i] = lower[index];
}
*((v128_i1 *)res) = result;
}
```
All of the vector indexing operations appear to generate a temporary memory store of the whole v128 local, then a memory fetch/store of the specific lane being accessed, and potentially a v128 memory load to update the local. According to godbolt (https://gcc.godbolt.org/z/PM9oao635) the `int index = indices[i] & 31` bit looks like this for example:
```
local.get 4
local.get 7
v128.store 48
block
block
local.get 4
i32.const 48
i32.add
local.get 2
i32.const 15
i32.and
local.tee 1
i32.or
i32.load8_u 0
i32.const 31
i32.and
```
It might be valuable to detect that a vector local has operations like this performed on it multiple times (or even once) and if so, promote (demote?) it to living on the stack (in memory), so that indexed reads/writes are still efficient. It might de-opt the operations that can be performed natively on vectors from locals, but not significantly - afaik the difference between `local.get` and `v128.load` on x64 is that it uses a different type of memory load (2 offsets vs 3 offsets), they're both still memory loads.
Sorry if I filed this in the wrong place! It seems to me like this would be handled at the llvm level.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyMVtGO6jgS_RrzUmqUOBDCAw_dl0Gah9WutB-AnLhCvO3YkV2By3z9qpxAA3N7ZlAUEtt16lTlVNkqRnNyiDux_hDr_UKN1Pmw-zwtaq-vu_dQGwoqXME4jT-NO4Fv4YwN-QB0HTCCGgZUIQJ5qBHiWPuBTK8sGAcURvcJjVXuBHhGBxdDHbz9u1jCgQE6w3ZXwJ-qHyyCkNUwWosa2uB78GOAxmusVURQTkOkYIYBNWh_cUJuRfEusr3Ibvcym65mfpeFcY0dNYIofkTSxtGyE8Vv30zX3ttv5y8q9sdoep3L6nFRunMqNLZw9kaDkO9wGrxxhEEUH88rjKPqSHDOZXU0ORyPiiiYeiQ8Hjl-Iaspvcdo_kgZyUuOdLpuaNOdvU1PyddwHFTziZpJHmM3tu2U0hsXCBiF_HHnBkfrLxieh8ZheB0yTpuGTbcgNncC21sI8zSIYs-hzzHMk2lg-wAh5I87gsi2icFfms4cXw0Tz780nCN5NQwYR0uT5eYDMhCb_UteRbZtfeDUGUdg0tpMFB_p8QfkZXoW8iNdz1kR2TYZcbkkw1vo6w8j1sy2hCJ_cJgs2snZZPMb5Gvm_bDgzvsOU-xhCnD9Mdm9Ma_1_gUZbcS_BZpSPAO9Ymz2L6n5Jt9hVkixh9nDHeUOcavPR8R3a7mnUIe3vvLVagYMiox3ty7DTeaEjkcRFBD2g0_dqcfehytE8gFvaJfOW0yFBtY3yrKoqUMH6ra8RWo6IQ9PZnHAxrSmAascQo1MRDUNxoiaIbgPDZ7QkVHWXkFNLmZI65VmluOgmSMDJudLeG8aHzSjcRBe155VKKuOaIjcyORByMOpaZbz5NKHk5CHP4Q8_OdfW698WbAsEqYos3-isjKD2hBY7z8jWPOJU8tlcc8t96uDPn8bmH8T-RPS_L6ap1_HN89mnJLllNbJrHqer61vPuFXY1-_f0bEFHLZeBfv4y-eeF5p_Tdo8hu0fP0LNPdrNEK8WT2g-fAaCY-yTKrjmN6zP08_Uige0b58_7KWfifozakj3onPyo6qtshy00jYEFCniPU6lVliDZ2Kj3X2JZIBQ-tDjxq8A0PQj5YMb9Jkeu72smIV8abuXYMsTCZnWoiey2QIvveU9h-N_CSKAy8yxISsOacCd1PJkWI1cAec62hq2xD9xDnpHDUEVDoKebgEQ3z0CGxrrAVsW9MYdLSEew40vvmBkoOHABNeoxyn6CtEp8ic0V6Z0ZSeOB1BUpLSrlmPBM4T8HmJ-4NyZK_wBqpV5jN50aZtMaBruG3QBdFxnd6lxtWYvl-ZpfJgDfCYd_CzXIGZuRmCMXJwdzxKJy3uT49NRshKgm_biBThHKG4vcy5ow6vQm4CQu2pm_P0ABCXj8r5rw98yGvhd2gNH8CSBsz0fS7BuxMMVvF3zjnDEbFPR74eHyRz8aPVnNhOOc0gakq_teceLJ7RLhd6V-htsVUL3OVltVlttnJdLbrdJm-xwCrLV7psio2qcF2UzWqV5Zu1lKpZmJ3MZJGVssiyfJVXS1lWZdViq8s8K8o2E6sMe2Xskt1x81yYGEfclcWqzBZW1WhjOuZK6fACaVJIyafesGObt3o8RbHKrIkUv1DIkE3nYz4BcnONXYrzsYxiktMk-fRpZrVPUsdnoZv2JujFGOzuZQMw1I31svG9kAfmMP-9DcH_DxsS8pCYcxmkyP4fAAD___EIjHc">