<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/89858>89858</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Converting Vector operations to different element widths should be automatically considered by the compiler wrt interleaved vectors
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
These two functions are equivalent (on little-endian):
```zig
const VEC_SIZE = 8;
export fn foo(byte_idx: @Vector(VEC_SIZE, u8)) @Vector(VEC_SIZE * 2, u8) {
const pairs: @Vector(VEC_SIZE, u16) = @bitCast(std.simd.interlace([_]@Vector(VEC_SIZE, u8){ byte_idx, byte_idx }));
return @bitCast(pairs + @as(@Vector(VEC_SIZE, u16), @splat(0x100)));
}
export fn bar(byte_idx: @Vector(VEC_SIZE, u8)) @Vector(VEC_SIZE * 2, u8) {
return std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) });
}
```
However, they compile differently:
```asm
.LCPI0_0:
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
foo:
vpunpcklbw xmm0, xmm0, xmm0
vpaddw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
ret
bar:
vpcmpeqd xmm1, xmm1, xmm1
vpsubb xmm1, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
ret
```
This especially becomes a problem if we increase `VEC_SIZE` to 16:
```asm
.LCPI0_0:
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
.short 256
foo:
vpermq ymm0, ymm0, 216
vpunpcklbw ymm0, ymm0, ymm0
vpaddw ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
ret
bar:
vpcmpeqd xmm1, xmm1, xmm1
vpsubb xmm1, xmm0, xmm1
vpunpckhbw xmm2, xmm0, xmm1
vpunpcklbw xmm0, xmm0, xmm1
vinserti128 ymm0, ymm0, xmm2, 1
ret
```
These can have different performance characteristics depending on the machine. On Zen 2, `vpermq` has a latency of 6, whereas the rest of these instructions all have a latency of 1. Except, of course, there's also the `vpaddw` which has a memory operand, which I presume will be slower than not going to memory, assuming the compiler is right to prefer an identity `vpcmpeqd` to load all 1's in a vector, rather than using a memory operand for the same purpose.
https://zig.godbolt.org/z/EKPe1bY35
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMV1GP2jgQ_jXmZdTIcSCEBx4K7OqqO-kqXVXp-rJy7Anx1bFT2yFLf_3JSaDLFlZ7Ulc6hHCwxzPffDNjT7j3am8Q12SxIYvdjHehtm79mWslufs6K608rj_V6BFCb6HqjAjKGg_cIeC3Th24RhOAsMIa0CoEje_QSMUNYSuSvSd0R-jpN6fj97vajzPCGh_g89324a8PX-6AZDsoSLZ5ugkfW-sCVAYqawkrymPAByUfSfYeyJx-RhGsI6w4KSFsC10RjbPVVQEg7D2wsxiQ5WQPAGAE1HLl_IsG0nzYmu2iSKnClvtAWOGDTLxqZKJMQKe5QMIKstg8kMXuZbDLDZw9Y9vzM5DlbvSFZBv4gdNh6Jy5ND6gBsI2cZr7aPlF-PGJzKlvNY_b6WNK6WRrMDdFYbm7Ho6SuzcPx-TmVVqT25y9hoLiZwbSE8yR858ZOOXvU0J-sz0e0EVVocYjCNu0SiNIVVXo0AR9vFUG3DfjTPLH9uMH-kDPgjB9El9HtoEt8mnh55mboq9ceANNsVKfu3JoO9OKr7rsx_-PTUMjaRfjsx1cyv6aYG-dhDY4IIuNU-0Q8DOHi92lGofhKfsxbc_YDq1oWvwmT7KPTZNONn6Mz0D5riwvJOkzyVd4mt6CeC3DPtXKA_oWheJaH6FEYRv0wKF1ttTYgKqgR1BGOOQegeT0nOo5hWAhzX9lDr5V_vziNP8_l8pLMK9XD7rmG8BxyqHTyNL8RtY9lzzeqq9rkr-swM6m3qDOzpKD43U5HBXsVcKvqc2DMh5dUCkrrhJ6MvYfazm2UoIbqPnhySUBLbrKuoYbgSBq7rgI6JQPSniQ2MaWyuzBmnjJQMNFrQwm8KeBL2jG65PkdMySWPM1j-eD5gGNOIKtII8ifY3xhBh0OPQhLoQBkTI-uO7U22k9wrvQkCZw9yiwDVGTrUDYznmc7j2HhC3jTm8H7QOYmF4RTF8rUU-QGmysO4Jt0XEjR1Bx9QO0Dn3XIPRKaygRvLY9Ogg1N2BsgL2NDAQ7qYhbufddM8zWeLp4HSgPTu3rEGVbhxU64AaURBNUOI7QxnScTkdtuRycTgcnlAEOh6lp2ILj0b8RR-ejtedeQGXdAMHzBqHtXGs9JvA07nUIbewoCbsn7P672id7K0urQ2LdPs4Qdn_3-0dMy7-zxUyuM7nKVnyG63SZZis2ZzSb1WvMV3NcsiLnBU9lKniazQteiorRpVzMVzO1ZpTN6ZzNKaMZY8mqKjJcMCqWosrmIidzig1XOtH60ETbM-V9h-tiVSyKmeYlaj-8CzBmsIdhkTAWXw3cOu55V3Z7T-ZUKx_8Dy1BBY3rrTWHWDJmD2PTNTI0ZlWwT9IdNTZx7JUMtQdf207LGHXeBdvwoMRw1cV2XEl0KKE8Xka5dwHGdhD5AeUUMD_rnF5fsr1Xoe7KRNiGsPuIeBretc7-gyIQdj_46Qm7H3j4NwAA__8arcIp">