<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/109122>109122</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [Aarch64] `clz` on a vector of 2 x u64 should be better optimized
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    This code ([Godbolt link](https://zig.godbolt.org/z/4j538eG1P)):

```zig
export fn clz(x: @Vector(2, u64)) @Vector(2, u64) {
    return @clz(x);
}
```

Gives me this emit for the Apple M3:

```asm
clz:
        ushr    v1.2d, v0.2d, #1
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #2
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #4
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #8
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #16
        orr v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #32
 orr     v0.16b, v0.16b, v1.16b
        mvn     v0.16b, v0.16b
 cnt     v0.16b, v0.16b
        uaddlp  v0.8h, v0.16b
        uaddlp v0.4s, v0.8h
        uaddlp  v0.2d, v0.4s
        ret
```

It seems to me we could combine `bitReverse`+`ctz` to get better emit for `clz` for vectors where each operand is a u64.

It's also conceivable that we could use `clz` with u32 granularity and combine adjacent elements.

I think it should do something like this:

```zig
export fn clz2(x: @Vector(2, u64)) @Vector(2, u64) {
    const clz_with_u32_granularity: @Vector(4, u32) = @clz(@as(@Vector(4, u32), @bitCast(x)));
    const base = @as(@Vector(2, u64), @bitCast(clz_with_u32_granularity)) >> @splat(32);

    const mask = @select(u32, @as(@Vector(4, u32), @bitCast(base)) == @as(@Vector(4, u32), @splat(32)), 
 clz_with_u32_granularity,
        @as(@Vector(4, u32), @splat(0)),
 );

    return base + @as(@Vector(2, u64), @bitCast(mask));
}
```

That gives us this assembly:

```asm
clz2:
        clz     v1.4s, v0.4s
 ushr    v0.2d, v1.2d, #32
        movi    v2.4s, #32
        cmeq v0.4s, v0.4s, v2.4s
        and     v0.16b, v1.16b, v0.16b
        usra v0.2d, v1.2d, #32
        ret
```

Alternatively, the `usra` could probably have been an `add`.

Assuming I didn't mess anything up, Z3 seems to prove this is a correct transformation? https://alive2.llvm.org/ce/z/878QXU
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0VlFv6jgT_TXmZXSR44QQHnig5VLdh0_6dnX3arUvleNMiFvHztpO2vLrV04CLRTYVtpWqAnO8ZkzDnNmuHNyqxGXZHZDZusJb31l7PIXV7Lg9nGSm-Jl-bOSDoQpEAjLyOzmzhS5UR6U1I9ktiYsq7xvHIlXhG0I2-zkdrodMFNjt2GFsE3yMIszvIv-T9gifOIVoWtC9_9TOnx2cjus4HNjrIdSg1A7wrJnEq-AJPQXCm8sYRkj7BbaNBn4Lj0CMr8ZCAEALPrW6oDdkwYlI4DM1ydi3iq8kx06qBF8OA6spYfSWPAVwqppFML_4ks5cVcPKyHoHgPjX-sqG65dNGVF0N3R8YawODrGGmvD4yjNR-D-LurvPs7Lvog3-SLe7It4o_SLiA8nHBh71AdZ606fxw8oof21x3tpvChU06Oy6l9AHZ0mbsRk1UWeQ4qJO8ZY9Feq5ocHh1g78CbUzhOCMK0qQJg6lxqBpDSX_nfs0DoMe9kNSanwO5LSsGeLHnL0Hu1ryQWA6gHhW9cXvYOnCi0CclGBadByXYB0wIMNTI8VETZ3wJUzIIwWKDueq1DW3L_qax2-ifMkfQVtzGBruW4Vt9K_QIiwT4MXD1yg9oAKa9TeHYcMnqEfQXpwVU9fGHCmxrC8BSUfB1f5nCey_8oUhdHOB8b7kOd9G7P7N3meRkh6mpj1NPH61UtJQrkbrufAfWEk4W3fcuf33jv2gndqch5ewED_jvUoyRPWi2mMBxJ_J_H3sMU1iocNg7hDDzjRUXP3uNfhUKEIW_qEbs9Ku5ZwyOkgY30pufcMx0rH5dEOLmZ7e1yln4tED4FGlvNHNHbT4V2xm0-_q3C4J7-A6z34Z6jRbd-IWzc0Yu4c1rl6-UjvZe-ar1A7GD38YIIHgzsY_MH9ojMGv_dt08kezEamMxhR499HfjvesHemGrzlxOajq4bvLP-Yzut2vVIereZedqjCj6gfb0hKA38wwsEcG2tynqsXqHiHkCNq4DrAeFGQlB5538q5tg4u9wMKWWjC5h5qdA64fhnsr21CoL_i10bRWNONc1Zv4cJYi8KDt1y70tiae2k0iTdwPHhyJTtkU6W6ehw8BY7TZzbPfvvzj0mxjItFvOATXEZzlqazZBFnk2rJ54iCzWlasiLPk_miTEVZMk7zxayYs3wil4yyhC6iLKKMRdm05ElUJnGZUFqkHEuSUKy5VIfoE-lci8uILiLGJornqFw_ZjOm8Qn6p4SxMHXbZdj0LW-3jiRUSefdK42XXvXz-YpbUaUJma3fdCajgY8tEEwJDJ5Dse3bTI775mkaL2u5w2LSWrU8Prat9FWbT4WpCduEuOPlW2PNQ-94m16tI2wzptMt2T8BAAD__-6RUhI">