<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/79799>79799</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Spurious optimization on Zen 4 from valignq + vpalignr to vmovdqa64+vpermt2w
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
[Godbolt link](https://zig.godbolt.org/z/vx95jhzro)
```zig
const std = @import("std");
const builtin = @import("builtin");
const chunk_len = if (std.mem.eql(u8, builtin.cpu.model.name, "znver4")) 64 else 32;
const Chunk = @Vector(chunk_len, u8);
fn prev(comptime N: comptime_int, a: Chunk, b: Chunk) Chunk {
return std.simd.mergeShift(b, a, chunk_len - N);
}
export fn prev1(a: Chunk, b: Chunk) Chunk {
return prev(1, a, b);
}
export fn prev2(a: Chunk, b: Chunk) Chunk {
return prev(2, a, b);
}
export fn prev3(a: Chunk, b: Chunk) Chunk {
return prev(3, a, b);
}
```
emit:
```asm
prev1:
valignq zmm1, zmm0, zmm1, 6
vpalignr zmm0, zmm0, zmm1, 15
ret
.LCPI1_0:
.short 63
.short 0
.short 1
.short 2
.short 3
.short 4
.short 5
.short 6
.short 7
.short 8
.short 9
.short 10
.short 11
.short 12
.short 13
.short 14
.short 15
.short 16
.short 17
.short 18
.short 19
.short 20
.short 21
.short 22
.short 23
.short 24
.short 25
.short 26
.short 27
.short 28
.short 29
.short 30
prev2:
vmovdqa64 zmm2, zmmword ptr [rip + .LCPI1_0]
vpermt2w zmm0, zmm2, zmm1
ret
prev3:
valignq zmm1, zmm0, zmm1, 6
vpalignr zmm0, zmm0, zmm1, 13
ret
```
Is this a real optimization or a mistake? The compiler also does not seem to revert back to `valignq+vpalignr`, even for size-optimized builds. It could be emitted like so in ReleaseSmall:
```asm
prev2:
valignq zmm1, zmm0, zmm1, 6
vpalignr zmm0, zmm0, zmm1, 14
ret
```
According to uops.info, [VALIGNQ (ZMM, ZMM, ZMM, I8)](https://uops.info/html-instr/VALIGNQ_ZMM_ZMM_ZMM_I8.html) has a latency of [4](https://uops.info/html-lat/ZEN4/VALIGNQ_ZMM_ZMM_ZMM_I8-Measurements.html) and [VPALIGNR (ZMM, ZMM, ZMM, I8)](https://uops.info/html-instr/VPALIGNR_ZMM_ZMM_ZMM_I8.html) has a latency of [2](https://uops.info/html-lat/ZEN4/VPALIGNR_ZMM_ZMM_ZMM_I8-Measurements.html). Compare that to [VMOVDQA64 (ZMM, M512)](https://uops.info/html-instr/VMOVDQA64_ZMM_M512.html) with a latency of [[≤9;≤11]](https://uops.info/html-lat/ZEN4/VMOVDQA64_ZMM_M512-Measurements.html) and [VPERMT2W (ZMM, ZMM, ZMM)](https://uops.info/html-instr/VPERMT2W_ZMM_ZMM_ZMM.html) with a latency of [5;6](https://uops.info/html-lat/ZEN4/VPERMT2W_ZMM_ZMM_ZMM-Measurements.html). Is the compiler banking on the idea that maybe the CPU will hoist the load?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8V19z2r4S_TTiZScee20DfuAhhPKbzJTc_ru5M3nJyHjBaiSLSjJp-fR3ZAwhxG6b9DfNOHi01u45R-tdydxasa6IJiydsnQ24LUrtZnccikKbh4GuS5--Gf_6CLX0oEU1QNLZwzHpXMby-JLhnOG851YB-v9nECbtbcwnG-_Z-nXcmc0w4yFMxZetr_DcH_txHpvWerKOrCuABbPgCWhUBttHMMxQ7SuYIg-Rjw9nZ7XQjpRdbm0j87cTp2XZV093Evau4sVMBxbVwSKVEDfJMNxPWZ4dQAJlps6ULogGVRckX_CEHfVlkzSomAGwwRIWoIYz6heebQD0VtaOm0Yjo8cfDgPd051VcHG0NZP1WrjhCK4YfElHEb3onLel3tjg9FQPhllB-hRGxgAwJCrTeWXO7BCec1mTZ9LsfKrl-8D4tXJEl3AzTNyo9kpS_ruFx5ashHD8dv4tFKjI4H8t0HxT0HxDaDxn4LGvwI91MkzDko4X3dd9cSt2lv2iTjMgvZvy6VYV99gp1SzyDulwvbejIdn0zfNfHMYn0x_5halz_0MuVN2wfurD9fRffiCTmBLv5owjLvtYbc56jZjt7kndtJtTnsYdptH3eZxtznrkdMns0dn1CM06lEa9Ug9z9rR3iM26lEb9ciNevRij17sy2uPXuzRiz16sUcv9ujFHr3Yoxd79MbhU0niy5JUelt848OkHe-Uwra2HrUpYOMMsHRqxAYYTuFYS-nsvFbJKIeP8BTnUKN4rNWfVem-of2tjhH3cenqeNcWXCkscDDEJWi_9Ykdd0JXoA1wUMI6_kAsnsOXkprdUUgywKXVUGiyUGkHlkiB02BoS8ZBzpcPfsiGYauS4fQgwDPAK6AtVbDSBqzY0UULTEVzKChsANcOlrqWBeQEvi87KkCKBwKrQVTwiSRxS58Vl_K3WrZ_P-DvZCB5TQYul0ttClGt_YLVemMDUa10cwhKp7eX76__ufnoD1B3i4U3Pr9dNwebjkPjSaB56ZS8EJV1huG8jXh_t1gc_6_HgZ_j99WS-3dBckfV8gfolSeR_BaA5I7h_O7dTdKLcrEgbmtDiipnj5C8KhqpHxqfT_-m1jbka8TiG8R2w3SqDeBKqw03BK7krqmRdHq7-M_t7OPlMDnRvkgjfLXeQ5yGgY9wlPooXPlCq7_eIRtn7DJpjkiHQRR53NcvxAv8X2T83afFF_xfT8Zfm-t9sNMk_Fx9yuLp8C3ZfgnUk-qmuZ70zJxXD77QddXYRUF8_xoo_iOnxnb14b_wKKSEUgvrGpPUvGDxfFBM4iKLMz6gSTQKhymGSYqDcjIcxqM0HKWYZUWeU5EXcZTTmMd5Mc6yKB-ICYaYhBFmYYajGINkGQ5XIeYYhdloSMSSkBQXMpByq_w35kBYW9NklI2ybCB5TtI2X7GIFT1C89B_mKWzgZl4n4u8XluWhFJYZ5-iOOEkTT5vaiN0bc82lwruqIIEVkarYyf2-_CxzTr9tIU3-8d-Gx7URk6eZ2wtXFnnwVIrhnMP394uNkZ_paXPXkPaMpw3ov4fAAD__8IDZII">