<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/103498>103498</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AVX-512] LLVM is inconsistent on whether to move a routine to k-registers or not
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    I had this function:

```zig
export fn run_lengths1(bitstr: u64) @Vector(32, u8) {
    const ends = bitstr & ~(bitstr >> 1);
    const starts = bitstr & ~(bitstr << 1);

 const iota = std.simd.iota(u8, 64);

    // We narrow to just the first 32 values because 32 is the maximum popCount of ends/starts
    const end_positions = std.simd.extract(vpcompress(iota, undefined, ends), 0, 32);
    const start_positions = std.simd.extract(vpcompress(iota, undefined, starts), 0, 32);

    return end_positions - start_positions + @as(@Vector(32, u8), @splat(1));
}
```

Translates to this for Zen 4:

```asm
.LCPI0_0:
        .byte   0
        ; ... <iota vector of [0,63] here>
run_lengths1:
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_0]
        kmovq   k0, rdi
        kshiftrq k1, k0, 1
        kaddq   k2, k0, k0
        kandnq  k1, k1, k0
        kandnq  k2, k2, k0
        vpcompressb     zmm1 {k1} {z}, zmm0
        vpcompressb     zmm0 {k2} {z}, zmm0
        vpsubb  ymm0, ymm1, ymm0
        vpcmpeqd        ymm1, ymm1, ymm1
        vpsubb ymm0, ymm0, ymm1
        ret
```

So far so good. Looks like LLVM is moving `bitstr` over to a k register before executing the first two lines of my `run_lengths1` procedure. However, when I compute `bitstr`:


```zig
export fn run_lengths2(a: u64, b: u64) @Vector(32, u8) {
    return run_lengths1(a | b);
}
```

It decides to do the computation in regular registers, then move the two outputs into `k` registers separately:

```asm
.LCPI1_0:
        .byte 0
        ; ... <iota vector of [0,63] here>
run_lengths2:
 vmovdqa64       zmm0, zmmword ptr [rip + .LCPI1_0]
        or      rdi, rsi
        mov     rax, rdi
        shr     rax
        lea rcx, [rdi + rdi]
        andn    rax, rax, rdi
        andn    rcx, rcx, rdi
        kmovq   k1, rax
        vpcompressb     zmm1 {k1} {z}, zmm0
        kmovq   k1, rcx
        vpcompressb     zmm0 {k1} {z}, zmm0
        vpsubb  ymm0, ymm1, ymm0
        vpcmpeqd        ymm1, ymm1, ymm1
        vpsubb  ymm0, ymm0, ymm1
 ret
```

[Godbolt link](https://zig.godbolt.org/z/vaM76r3G7)

I think on Zen 4, it's probably better to do this in `k` registers, but for Zen 5, it's better to stay in regular registers since all k-register operations have a latency of 2 cycles. But that's beside the point. This issue is about consistency. Why is LLVM not consistent in this decision?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0V1tv2zoS_jX0yyCCRMm2_OCH2mm6AVJggS3Sxb4UlDi2WEukQlJKnIf97QdD-Z5L04OeILAkcuabC4cfh8I5tdaIczZesPH1SHS-MnZ-L2olhd2MCiO381uohARfKQerTpdeGc3STyy-ZvH-dxIP_89qPYzgU2ush5UG2-kfNeq1r1zCeF4o77xl6SfoJhnjM2BZfI-lN5bxPOWML6HLw_h0MUABAJRGOw-opQOWXsMAAoxP4P8HTGDpZ5Z-hoTxGUtfKDsvrH9ffcnS5bn6DmRAUMaLoO-8jJxqZEQjjOfk8BJCOJeaAMD4DeM38B1BC2vNI3gDPzvnwVcIK2Wdh5RDL-oOHRRYis4hjSgXJBrxpJqugda0S9NpD2YVEsH4zRDSK1n60RqnaJ3cub_45K0oPeN535amaS06x3g-hLGETktcKY2SPgYbM3qN6YfW5s28_gmDu2jeNHk0bNF3Vl8EevXSE76g6hJk8Y0qoxeWxa6tBTkZ1v7U5PT6osBPXflmhXa18OhoSYf9YSz8DzVkb20Q4ZphJLpb_vs2_hEfBGH3FxVbjwAQnw-zdAFRFFGVhjrsQzRUDGy8oGxNUja-hgot0j4Iumdb79JO35hePohJtvt-bpqQ9OemeTRWQktbYrywqg2JPPg7vj7H2TSmf6Bn0LZSXUy7Sq28fYBNQvODVHIhI6QMEPwosokvZbTUD7CHSd6VGXD4azLHOiz2YSfENZuETa_p5ZkWfcjDr1XjoMo_oOq6ogDY7pK8bZpk93xppGnxQe6_TySPz9ewT6Dj1wUt-neq-T8GVsKCM7A2RkZwZ8zGQa02CHd391-JjhrTK70GNol3LD6JwfRoqfwFbMDiWjmPFgpcGYuAT1h2nlSOVOcfDdRKo6PSbbYEdlamkxhaa0qUncUI_mUesUdL8TxWqOEWaAk6j2dOXG623zmTOOO5OBxHSyh-92jakdHFMSeATQnsw2Ry60FiqeRAJtKElA3BCqIzUJry29XCHvLsyCVPeWlMj0GD0ms633begdLeUJ42lNSDDjhshRUe6-3HSCp5k6T-LEXxo5W_w03JK9xk7K7ypQrk5C7IqTH9MC-eXiUvV9nD_NlEjQJsGZTID6mCH6R_6QJR0qmJNywdxAbQ_eMFm-7JNtmD_Slyu0Auf40cf5g2_3Hue4_83mc9Nl58MbIwtSdW2tDy8bzyvnVUjaFze1braD3IRMauaYTxm158nU5s-mVKW_x0H1MjoDdg9K4R4EtQnvGpI14rRFFvoUDvB9aUu75B6ZcbNbBR5w89xfgE6ojgvNi-yg3glC4RRF3D5urAzKZFK4b2qBI9ggBqYHS5pW3KodyWNboIFh21p2JvzCk50EtrlPYRfAs-O9chHQuiMJ0P3SDZ0OU2gu_VlmbCwaHNyaQnX0PIxHYuXCVuRnKeylk6EyOcJ1OeTmKexPGomk9wGueZzMZFOiuSXOZ8OpM4i6WYYJIkk5Ga85hncZ5kMedJFkcJ5ivMRLkqZnyazTjLYmyEqqO67htavlFwe57EaTbLR7UosHbh9sO5xschKMY5XYbsnJSuim7tWBbXynl3hPHK1-Ha9On-v1fjhBQO56TSJ_EaTUeXr4blCkQtwBo6F5FGjovjiK-08aPO1vPzGlwrX3VFVJqG8RvyYfe4aq35idRg3wTP6VawC62f878CAAD__8eSBPE">