<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/89858>89858</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Converting Vector operations to different element widths should be automatically considered by the compiler wrt interleaved vectors
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    These two functions are equivalent (on little-endian):

```zig
const VEC_SIZE = 8;

export fn foo(byte_idx: @Vector(VEC_SIZE, u8)) @Vector(VEC_SIZE * 2, u8) {
    const pairs: @Vector(VEC_SIZE, u16) = @bitCast(std.simd.interlace([_]@Vector(VEC_SIZE, u8){ byte_idx, byte_idx })); 
    return @bitCast(pairs + @as(@Vector(VEC_SIZE, u16), @splat(0x100)));
}

export fn bar(byte_idx: @Vector(VEC_SIZE, u8)) @Vector(VEC_SIZE * 2, u8) {
    return std.simd.interlace(.{ byte_idx, byte_idx + @as(@Vector(VEC_SIZE, u8), @splat(1)) });
}
```

However, they compile differently:

```asm
.LCPI0_0:
        .short  256
 .short  256
        .short  256
        .short  256
        .short 256
        .short  256
        .short  256
        .short 256
foo:
        vpunpcklbw      xmm0, xmm0, xmm0
        vpaddw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

bar:
 vpcmpeqd        xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm0, xmm1
 vpunpcklbw      xmm0, xmm0, xmm1
        ret
```

This especially becomes a problem if we increase `VEC_SIZE` to 16:

```asm
.LCPI0_0:
        .short  256
        .short 256
        .short  256
        .short  256
        .short  256
 .short  256
        .short  256
        .short  256
 .short  256
        .short  256
        .short  256
        .short 256
        .short  256
        .short  256
        .short  256
 .short  256
foo:
        vpermq  ymm0, ymm0, 216
 vpunpcklbw      ymm0, ymm0, ymm0
        vpaddw  ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        ret

bar:
        vpcmpeqd        xmm1, xmm1, xmm1
        vpsubb  xmm1, xmm0, xmm1
        vpunpckhbw xmm2, xmm0, xmm1
        vpunpcklbw      xmm0, xmm0, xmm1
 vinserti128     ymm0, ymm0, xmm2, 1
        ret
```

These can have different performance characteristics depending on the machine. On Zen 2, `vpermq` has a latency of 6, whereas the rest of these instructions all have a latency of 1. Except, of course, there's also the `vpaddw` which has a memory operand, which I presume will be slower than not going to memory, assuming the compiler is right to prefer an identity `vpcmpeqd` to load all 1's in a vector, rather than using a memory operand for the same purpose. 

https://zig.godbolt.org/z/EKPe1bY35
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMV1GP2jgQ_jXmZdTIcSCEBx4K7OqqO-kqXVXp-rJy7Anx1bFT2yFLf_3JSaDLFlZ7Ulc6hHCwxzPffDNjT7j3am8Q12SxIYvdjHehtm79mWslufs6K608rj_V6BFCb6HqjAjKGg_cIeC3Th24RhOAsMIa0CoEje_QSMUNYSuSvSd0R-jpN6fj97vajzPCGh_g89324a8PX-6AZDsoSLZ5ugkfW-sCVAYqawkrymPAByUfSfYeyJx-RhGsI6w4KSFsC10RjbPVVQEg7D2wsxiQ5WQPAGAE1HLl_IsG0nzYmu2iSKnClvtAWOGDTLxqZKJMQKe5QMIKstg8kMXuZbDLDZw9Y9vzM5DlbvSFZBv4gdNh6Jy5ND6gBsI2cZr7aPlF-PGJzKlvNY_b6WNK6WRrMDdFYbm7Ho6SuzcPx-TmVVqT25y9hoLiZwbSE8yR858ZOOXvU0J-sz0e0EVVocYjCNu0SiNIVVXo0AR9vFUG3DfjTPLH9uMH-kDPgjB9El9HtoEt8mnh55mboq9ceANNsVKfu3JoO9OKr7rsx_-PTUMjaRfjsx1cyv6aYG-dhDY4IIuNU-0Q8DOHi92lGofhKfsxbc_YDq1oWvwmT7KPTZNONn6Mz0D5riwvJOkzyVd4mt6CeC3DPtXKA_oWheJaH6FEYRv0wKF1ttTYgKqgR1BGOOQegeT0nOo5hWAhzX9lDr5V_vziNP8_l8pLMK9XD7rmG8BxyqHTyNL8RtY9lzzeqq9rkr-swM6m3qDOzpKD43U5HBXsVcKvqc2DMh5dUCkrrhJ6MvYfazm2UoIbqPnhySUBLbrKuoYbgSBq7rgI6JQPSniQ2MaWyuzBmnjJQMNFrQwm8KeBL2jG65PkdMySWPM1j-eD5gGNOIKtII8ifY3xhBh0OPQhLoQBkTI-uO7U22k9wrvQkCZw9yiwDVGTrUDYznmc7j2HhC3jTm8H7QOYmF4RTF8rUU-QGmysO4Jt0XEjR1Bx9QO0Dn3XIPRKaygRvLY9Ogg1N2BsgL2NDAQ7qYhbufddM8zWeLp4HSgPTu3rEGVbhxU64AaURBNUOI7QxnScTkdtuRycTgcnlAEOh6lp2ILj0b8RR-ejtedeQGXdAMHzBqHtXGs9JvA07nUIbewoCbsn7P672id7K0urQ2LdPs4Qdn_3-0dMy7-zxUyuM7nKVnyG63SZZis2ZzSb1WvMV3NcsiLnBU9lKniazQteiorRpVzMVzO1ZpTN6ZzNKaMZY8mqKjJcMCqWosrmIidzig1XOtH60ETbM-V9h-tiVSyKmeYlaj-8CzBmsIdhkTAWXw3cOu55V3Z7T-ZUKx_8Dy1BBY3rrTWHWDJmD2PTNTI0ZlWwT9IdNTZx7JUMtQdf207LGHXeBdvwoMRw1cV2XEl0KKE8Xka5dwHGdhD5AeUUMD_rnF5fsr1Xoe7KRNiGsPuIeBretc7-gyIQdj_46Qm7H3j4NwAA__8arcIp">