<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/79799>79799</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Spurious optimization on Zen 4 from valignq + vpalignr to vmovdqa64+vpermt2w
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    [Godbolt link](https://zig.godbolt.org/z/vx95jhzro)

```zig
const std = @import("std");
const builtin = @import("builtin");

const chunk_len = if (std.mem.eql(u8, builtin.cpu.model.name, "znver4")) 64 else 32;
const Chunk = @Vector(chunk_len, u8);

fn prev(comptime N: comptime_int, a: Chunk, b: Chunk) Chunk {
    return std.simd.mergeShift(b, a, chunk_len - N);
}

export fn prev1(a: Chunk, b: Chunk) Chunk {
    return prev(1, a, b);
}

export fn prev2(a: Chunk, b: Chunk) Chunk {
    return prev(2, a, b);
}

export fn prev3(a: Chunk, b: Chunk) Chunk {
    return prev(3, a, b);
}
```

emit:

```asm
prev1:
        valignq zmm1, zmm0, zmm1, 6
        vpalignr        zmm0, zmm0, zmm1, 15
        ret

.LCPI1_0:
        .short  63
        .short  0
        .short  1
        .short  2
        .short  3
        .short  4
        .short  5
        .short  6
        .short  7
        .short  8
        .short  9
        .short  10
        .short  11
        .short  12
        .short  13
        .short  14
        .short  15
        .short  16
        .short  17
        .short  18
        .short  19
        .short  20
        .short  21
        .short  22
        .short  23
        .short  24
        .short  25
        .short  26
        .short  27
        .short  28
        .short  29
        .short  30
prev2:
        vmovdqa64       zmm2, zmmword ptr [rip + .LCPI1_0]
        vpermt2w        zmm0, zmm2, zmm1
        ret

prev3:
        valignq zmm1, zmm0, zmm1, 6
        vpalignr        zmm0, zmm0, zmm1, 13
        ret
```

Is this a real optimization or a mistake? The compiler also does not seem to revert back to `valignq+vpalignr`, even for size-optimized builds. It could be emitted like so in ReleaseSmall:

```asm
prev2: 
        valignq zmm1, zmm0, zmm1, 6
        vpalignr        zmm0, zmm0, zmm1, 14
        ret
```

According to uops.info, [VALIGNQ (ZMM, ZMM, ZMM, I8)](https://uops.info/html-instr/VALIGNQ_ZMM_ZMM_ZMM_I8.html) has a latency of [4](https://uops.info/html-lat/ZEN4/VALIGNQ_ZMM_ZMM_ZMM_I8-Measurements.html) and [VPALIGNR (ZMM, ZMM, ZMM, I8)](https://uops.info/html-instr/VPALIGNR_ZMM_ZMM_ZMM_I8.html) has a latency of [2](https://uops.info/html-lat/ZEN4/VPALIGNR_ZMM_ZMM_ZMM_I8-Measurements.html). Compare that to [VMOVDQA64 (ZMM, M512)](https://uops.info/html-instr/VMOVDQA64_ZMM_M512.html) with a latency of [[≤9;≤11]](https://uops.info/html-lat/ZEN4/VMOVDQA64_ZMM_M512-Measurements.html) and [VPERMT2W (ZMM, ZMM, ZMM)](https://uops.info/html-instr/VPERMT2W_ZMM_ZMM_ZMM.html) with a latency of [5;6](https://uops.info/html-lat/ZEN4/VPERMT2W_ZMM_ZMM_ZMM-Measurements.html). Is the compiler banking on the idea that maybe the CPU will hoist the load?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8V19z2r4S_TTiZScee20DfuAhhPKbzJTc_ru5M3nJyHjBaiSLSjJp-fR3ZAwhxG6b9DfNOHi01u45R-tdydxasa6IJiydsnQ24LUrtZnccikKbh4GuS5--Gf_6CLX0oEU1QNLZwzHpXMby-JLhnOG851YB-v9nECbtbcwnG-_Z-nXcmc0w4yFMxZetr_DcH_txHpvWerKOrCuABbPgCWhUBttHMMxQ7SuYIg-Rjw9nZ7XQjpRdbm0j87cTp2XZV093Evau4sVMBxbVwSKVEDfJMNxPWZ4dQAJlps6ULogGVRckX_CEHfVlkzSomAGwwRIWoIYz6heebQD0VtaOm0Yjo8cfDgPd051VcHG0NZP1WrjhCK4YfElHEb3onLel3tjg9FQPhllB-hRGxgAwJCrTeWXO7BCec1mTZ9LsfKrl-8D4tXJEl3AzTNyo9kpS_ruFx5ashHD8dv4tFKjI4H8t0HxT0HxDaDxn4LGvwI91MkzDko4X3dd9cSt2lv2iTjMgvZvy6VYV99gp1SzyDulwvbejIdn0zfNfHMYn0x_5halz_0MuVN2wfurD9fRffiCTmBLv5owjLvtYbc56jZjt7kndtJtTnsYdptH3eZxtznrkdMns0dn1CM06lEa9Ug9z9rR3iM26lEb9ciNevRij17sy2uPXuzRiz16sUcv9ujFHr3Yoxd79MbhU0niy5JUelt848OkHe-Uwra2HrUpYOMMsHRqxAYYTuFYS-nsvFbJKIeP8BTnUKN4rNWfVem-of2tjhH3cenqeNcWXCkscDDEJWi_9Ykdd0JXoA1wUMI6_kAsnsOXkprdUUgywKXVUGiyUGkHlkiB02BoS8ZBzpcPfsiGYauS4fQgwDPAK6AtVbDSBqzY0UULTEVzKChsANcOlrqWBeQEvi87KkCKBwKrQVTwiSRxS58Vl_K3WrZ_P-DvZCB5TQYul0ttClGt_YLVemMDUa10cwhKp7eX76__ufnoD1B3i4U3Pr9dNwebjkPjSaB56ZS8EJV1huG8jXh_t1gc_6_HgZ_j99WS-3dBckfV8gfolSeR_BaA5I7h_O7dTdKLcrEgbmtDiipnj5C8KhqpHxqfT_-m1jbka8TiG8R2w3SqDeBKqw03BK7krqmRdHq7-M_t7OPlMDnRvkgjfLXeQ5yGgY9wlPooXPlCq7_eIRtn7DJpjkiHQRR53NcvxAv8X2T83afFF_xfT8Zfm-t9sNMk_Fx9yuLp8C3ZfgnUk-qmuZ70zJxXD77QddXYRUF8_xoo_iOnxnb14b_wKKSEUgvrGpPUvGDxfFBM4iKLMz6gSTQKhymGSYqDcjIcxqM0HKWYZUWeU5EXcZTTmMd5Mc6yKB-ICYaYhBFmYYajGINkGQ5XIeYYhdloSMSSkBQXMpByq_w35kBYW9NklI2ybCB5TtI2X7GIFT1C89B_mKWzgZl4n4u8XluWhFJYZ5-iOOEkTT5vaiN0bc82lwruqIIEVkarYyf2-_CxzTr9tIU3-8d-Gx7URk6eZ2wtXFnnwVIrhnMP394uNkZ_paXPXkPaMpw3ov4fAAD__8IDZII">