<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/110426>110426</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [Zen 4] Prefer to do work in k-registers instead of always moving over to general purpose registers
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    Some code ([Godbolt link](https://zig.godbolt.org/z/cxE9xsTr1)):

```zig
const std = @import("std");
const Chunk = @Vector(64, u8);

export fn foo(chunk: Chunk, a: Chunk, b: Chunk) u64 {
    const zeroes = @as(Chunk, @splat(0));
    const bits_a: u64 = @bitCast(zeroes != (a & chunk));
 const bits_b: u64 = @bitCast(zeroes != (b & chunk));
    return bits_a +% bits_b;
}
```

LLVM IR (optimized):

```llvm
define dso_local i64 @foo(<64 x i8> %0, <64 x i8> %1, <64 x i8> %2) local_unnamed_addr {
Entry:
  %3 = and <64 x i8> %1, %0
  %4 = icmp ne <64 x i8> %3, zeroinitializer
  %5 = bitcast <64 x i1> %4 to i64
 %6 = and <64 x i8> %2, %0
  %7 = icmp ne <64 x i8> %6, zeroinitializer
 %8 = bitcast <64 x i1> %7 to i64
  %9 = add i64 %8, %5
  ret i64 %9
}
```

Compiled for Zen 4:

```asm
foo:
 vptestmb        k0, zmm1, zmm0
        vptestmb        k1, zmm2, zmm0
 kmovq   rcx, k0
        kmovq   rax, k1
        add     rax, rcx
        vzeroupper
 ret
```

Suggested:

```diff
foo:
        vptestmb k0, zmm1, zmm0
        vptestmb        k1, zmm2, zmm0
-       kmovq rcx, k0
-       kmovq   rax, k1
-       add     rax, rcx
+ kaddq   k0, k0, k1
+       kmovq   rax, k0
        vzeroupper
 ret
```

On Zen 4, almost all operations in k-registers have a latency of 1 cycle (except [KSHIFTRQ](https://uops.info/html-instr/KSHIFTRQ_K_K_I8.html)). That means that LLVM should be more aggressive than it currently is to keep things in k-registers and do computation in there.

However, hopefully the cost-model also considers that general-purpose registers have access to a far more powerful instruction set, like `lea`, which can fuse multiple operations together, which is not supported in k-registers.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysVk2TmzgT_jXypcsuLMDggw-Zcfwmlby1u0lqDntxCakBrYVEJOEZz6_fEuDP-dhs1bpc2Ki7n3766QaJOScrjbgi6R1J1xPW-drY1QNTUjC7mxRGHFbfTYPAjUAgNCfp3f-MKIzyoKTekXRNaF573zoSfyB0Q-jmWVazavCZGVuFFUI3_Onj8sn9sHNCl-EbfyDRmkTH6yIavs-yGla40c6D8wJIvAaSRLJpjfWBAqXOC0JpD3N36X5fd3p3DHhA7o0lNF8khN5Dl1_6D1d8CphQaiiNITTnIZ7EHwagEMWu7oqLuyV0iwRINgICAAwkntEadEcWzBGanwBIErlWsVBGdBTiBUAhvdv2ifsMA04h_T1zIfCIT-e9ieYMCF0AH2ldYV4AFr8OWLwNCAAWfWf1yBIIvSM0PaU4yputbxp7KfvXrw__h8_fQirTetnIZxTvzIRS-2ZYElhKjSCc2SrDmQIZ6kmioXskvl8k8AQyJ_FHIDSNeslvV-evroZpgh5022nNGhRbJoQ9N_ij9vZwogghJu61ZFq8lSRQOLsP0kvetKDxZUgcQkIvpJZeMiWf0V5Ep310IT1nzp-j52N0At4EOcYIQtPF2-zoK-yy99kt3mZHaJq_Ty67JheWlgM7IYYe0jQfOaVHJ4v-aFv-ylTdm6aVCgWUxsKfqCF5a6CYG-cpzM2po_vWo_NNAeNn14_Pc9PMx9_o_AyEzwv_ox-98d81Zv8zPDj8KVh2NzgnMxvM82tzkAjO5gByTSM0pWvbUzss-ndU-t5VFTqP4i1xhCzLV9S5Lfq_Umd6pcKNRNP3JZr-g0SE3sGOCfHz1M3xOj_bX09wW8y_0_g3Pc5f2EBUY5wHphSYFi3z0mgHUsNuarGSzqN1ULM9AgPFPGp-AFPCHPiBq37LxSeOrQeS3n35_unz5se3P17bdTvTupnUpSF0U_tGTaV23hK6OQZtv2y_bD_ns2Ac3uoz-FEzDw0y7cCHv_2L2dWmUwIKhMZYBFZVFp2Teww-GqQH3lmL2qsDSBee7B1iC76WunpRWXj9CAPcNG3n--KDh6_R4uxSsk_mEfdog2K1abHslDoEN-DG-WljBCpgypl-R5MiQPeMK9RomZq2nW2NQ7jVlHN0PUcGJbNDRa15RFt2CnqJOt6zcuhDciV3CGHPQRaaSu_hsZa8Bs40lJ1DaDrlZavwspveVBhKOrtLB9p4cF0bDhgobmSZTcQqFst4ySa4mmc0W8yzLFpO6lWWZDHNYp4lcRynNM0TkSecL9JlnJSULiZyRSOaREu6nMdJTpczkeRZERVplJfRMhIZSSJsmFSzsG2GA9hEOtfhaj6PErqYKFagcv2Bj1KNj9Bbw2kqXU_sKgRNi65yJImUdN6dYbz0qj8pDrOdruF3iyXaoK4w8Gjs7rb7QWBkIgw0U4_s4KAxe6krMPshbmwfvGjfpLNqdT3hlfR1V8y4aQjd9GeC4WfaWvMXck_opq_FEboZi92v6N8BAAD__0-ZHiI">