<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/76727>76727</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
AVX512 constants not chached in a register
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
piotr-topnotch
</td>
</tr>
</table>
<pre>
When the same AVX512 constant is used in a function, it is usually not cached in a register, but the compiler chooses to emit the same memory reference. This wastes DCache bandwidth and emits code longer than a variant with the register. Consider this function (compiles with `-march=tigerlake -O2` for x64 targets, e.g on godbolt).
```
#include <immintrin.h>
static const __m128i ars_weyl_increment = _mm_set_epi64x(0xbb67ae8584caa73b, 0x9e3779b97f4a7c15);
__m512i fn_vaes(__m512i counter, __m128i key) {
// const __m512i ars_increment = _mm512_broadcast_i32x4(*(const volatile __m128i*) &ars_weyl_increment);
const __m512i ars_increment = _mm512_broadcast_i32x4(ars_weyl_increment);
const __m512i key0 = _mm512_broadcast_i32x4(key);
const __m512i r0 = _mm512_aesenc_epi128(counter, key0);
const __m512i key1 = _mm512_add_epi64(key0, ars_increment);
const __m512i r1 = _mm512_aesenc_epi128(r0, key1);
const __m512i key2 = _mm512_add_epi64(key1, ars_increment);
const __m512i r2 = _mm512_aesenc_epi128(r1, key2);
const __m512i key3 = _mm512_add_epi64(key2, ars_increment);
const __m512i r3 = _mm512_aesenc_epi128(r2, key3);
const __m512i key4 = _mm512_add_epi64(key3, ars_increment);
const __m512i r4 = _mm512_aesenc_epi128(r3, key4);
return r4;
}
```
Then you'll see
```
vinserti128 $1, %xmm1, %ymm1, %ymm1
vinserti64x4 $1, %ymm1, %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
vpaddq **.LCPI0_0(%rip),** %zmm1, %zmm2
vaesenc %zmm2, %zmm0, %zmm0
vpaddq **.LCPI0_1(%rip),** %zmm1, %zmm2
vaesenc %zmm2, %zmm0, %zmm0
vpaddq **.LCPI0_2(%rip),** %zmm1, %zmm2
vaesenc %zmm2, %zmm0, %zmm0
vpaddq **.LCPI0_3(%rip),** %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
retq
```
The compiler could be forced to emit better code by uncommenting the first definition of `const __m512i ars_increment`:
```
vmovdqa ars_weyl_increment(%rip), %xmm2
vinserti128 $1, %xmm2, %ymm2, %ymm2
vinserti64x4 $1, %ymm2, %zmm2, %zmm2
vinserti128 $1, %xmm1, %ymm1, %ymm1
vinserti64x4 $1, %ymm1, %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
vpaddq %zmm1, %zmm2, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
vpaddq **%zmm2**, %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
vpaddq %zmm2, %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
vpaddq %zmm2, %zmm1, %zmm1
vaesenc %zmm1, %zmm0, %zmm0
retq
```
But then the compiler builds `ars_increment` in an extremely complicated way with the sequence of inserts instead of simply using a single `vbroadcasti32x4`.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUl01v4zYTxz8NfRnEkEayZR988CZPgAco0B4WbW8CRY4tdinSISnH7qcvKMkb20kEZ9M91DAkSiT_8-PLUDPce7U1RCs2-8JmDxPehtq61U7Z4O6C3RkbRD2prDyu_qjJQKgJPG8I1r__OUsRhDU-cBNAeWg9SVAGOGxaI4KyhuE9qKGu5VofwdgAgov61NLRVvlALras2tDpC9vslCYHorbWk4dggRoVXow31Fh3BEcbcmQETeFrrTw8cx_Iw8N9NAAVN_JZyVADN7IT8CCsJNDWbMlBqHkk2HOn4gCeVag7CyekKdxb45Xsmir_fVDAcDEg-r4Xmyd3DXeiZtlDUFtymn8juPsV2TyBjXVwmOcQuNtS8HGgNN2CNbC1srI6MFxOWfLAkvVwnSfDv3_ETBmhW0nAsnvVNMoEp8y0Ztn_-hY-8KBEvxRQlk2KCwXc-fKZjrpURjhqyARg2QOUTVN6CiXt1Dw_MFwkh6qaF5wWs0UuOC-yKhImhyVlRbGslsUm54VIZwyXLPtyzlmWzSxFBRtT7jl5hovTG2FbM6zpCecbHRkugRUXEgDA8JHh4wt8JxDhX3HPUiwrZ7kU3IdSZXjIGS4Yrrv1iN33VvOgNJ2sdnVLYDh_PRvn4wGAHwcYl37PwDc6JuO6_YyNMLoLAU6ejIjrmsZJWZytQbR1I1N6ISllv096miRqXUxMLwrvAqZjgC4Z2NIb2XCELf0wG46ypQMb3siWjbDhh9myUTYc2LIb2fIRtuzDbPkoWzaw5W-zOQqtM-Dyl7ri4e1Tr7t-jZ-co20ZFlqDJxo5J2H47ZXx5EIk6p4Z5t1qMpwdmuZUPF4Ve4lT33l-yK_6nnX4-6p4ab6fk9ftkvPiZZcdl_IpWlszXE9_uf_t_0mZdGfbzKldnEq87ytfy-IJ_cIu_rDd9IN23xz2R8xfWMefb_3NUWc32v38YjsKT-M7_iz-sa2WUFGMIQTJ73FQRSF01ZKgOkJrhG2i7yqz7SKYjXI-gKSNMqqLWOwGYpAy8o2LGNn6Fvdq7F4-8Tfii6spHBzueqVGvRNfnO2ieIt34vnueHejfOJwuJL4kTPiU4fDKxf4d-XX3X9Q7h9-5kjwPyZ_5rfve--XPoMxl2lM1SotfXTAa5frciADdAjxlT52fbQSPJCEZ358SUo8PbUx04mu3G9AH--BuIyvvGp2-gitj0cAh3jTFC3uvweWfVw5T6YTucrkMlvyCa3SIskxTQpMJvWKlvlsQ8V8ITaCMM_yNM9lvlyQFDwTvJioFSaYJ2mC6WyGSTLNi4WUG5xxkRbFJilYnlDDlZ5qvW-m1m0nyvuWVsW8wGKieUXad5kmoqFn6CoZYkw83Sr2uavarWd5opUP_kUlqKBpdZVy-j6ZrN_IJiet06s6hJ2Ph1qXYWxVqNtqKmzD8DEKD7e7nbN_kQgMHzscz_Cxw_0nAAD__1cGXs4">