<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/110308>110308</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AVX-512] clz(32 x u8) and clz(64 x u8) should use an algorithm similar to avx2
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    This code:

```zig
export fn foo(x: @Vector(32, u8)) @TypeOf(x) {
    return @clz(x);
}
```

LLVM version:

```llvm
define dso_local range(i8 0, 9) <32 x i8> @foo(<32 x i8> %0) local_unnamed_addr {
Entry:
  %1 = tail call range(i8 0, 9) <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %0, i1 false)
  ret <32 x i8> %1
}

declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>, i1 immarg) #1
```

Results in this emit for Zen 4:

```asm
.LCPI0_0:
        .zero   32,24
foo:
 vpmovzxbd       zmm1, xmm0
        vextracti128    xmm0, ymm0, 1
 vpmovzxbd       zmm0, xmm0
        vplzcntd        zmm1, zmm1
 vplzcntd        zmm0, zmm0
        vpmovdb xmm1, zmm1
        vpmovdb xmm0, zmm0
        vinserti128     ymm0, ymm1, xmm0, 1
        vpsubb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        ret
```

LLVM mca claims this should take ~23 cycles per iteration. 

This is pretty unfortunate because if we downgrade to Zen 3, we get:

```asm
.LCPI0_1:
        .zero   32,15
.LCPI0_2:
 .byte   4
        .byte   3
        .byte   2
        .byte 2
        .byte   1
        .byte   1
        .byte   1
 .byte   1
        .byte   0
        .byte   0
        .byte   0
 .byte   0
        .byte   0
        .byte   0
        .byte 0
        .byte   0
foo:
        vbroadcasti128  ymm1, xmmword ptr [rip + .LCPI0_2]
        vpxor   xmm3, xmm3, xmm3
        vpshufb ymm2, ymm1, ymm0
        vpsrlw  ymm0, ymm0, 4
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
        vpcmpeqb        ymm3, ymm0, ymm3
 vpshufb ymm0, ymm1, ymm0
        vpand   ymm2, ymm2, ymm3
 vpaddb  ymm0, ymm2, ymm0
        ret
```

LLVM-mca says Zen 3 should be able to compute this in ~11 cycles per iteration.

We can reproduce this functionality in Zig like so:

```zig
export fn foo2(x: @Vector(32, u8)) @TypeOf(x) {
    const vec: @TypeOf(x) = comptime std.simd.repeat(@sizeOf(@TypeOf(x)), [16]u8{4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0});
    return @select(u8, x == @as(@TypeOf(x), @splat(0)), vpshufb(vec, x), @as(@TypeOf(x), @splat(0))) + vpshufb(vec, x >> @splat(4));
}
```

Which gives us functionally equivalent assembly, albeit reordered. LLVM-mca says that we can 10 cycles per iteration with the instructions reordered. Not sure if there is anything to that. [Godbolt link here](https://zig.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYgAzKVpMGoAF55gpJfWQE8Ayo3QBhVLQCuLBiABMpJwAyeAyYAHKeAEaYxBJcpAAOqAqE9gyuHl4SCUkpAkEh4SxRMQAcVpg2dgJCBEzEBOme3n7WmLapNXUE%2BWGR0bFWtfWNmXEKQ93BvUX9JQCUVqjuxMjsHGgM4wDU4%2BhbAKR6ACJbAAJ4LIn1EPs%2BPru3PnOHAEL7GgCCG9sR7nR2DAOxzOFyuBBud1%2B/2Cj2eejen3eHyogP4qAg6jOADU2kRiBA9H4tu55nMzgAVACe8UwAHkqBiyfsAOwIj5bDlbYiYAjLQGnZC0UyM15IllHMWfTCqMFbFFy1CoHwYkDY3EkAlEklzMmnKk0%2BmMg6spGcrbfAhbABubVVeupdIZqiZwLQlzsbB2BHQADpkixfdyaUxwadkqZHRB7QanTrSAcAKwvLgANn2CaOJJZL2kBj8fjihdIcQ0pFL5bLlfL4rhbLN3N5xH5rVsEBJ8dUQIlwNOTAUUf1ked8bD8SM4I0cet8QUCHcVAiEBtyA7CzOfYHDsNw7OCjHIYgk51Bx8L2ns/ni%2BXHaB2EO2F3%2B/Bkh1osRzIliM%2B8qtM7nC4gWoInELZjEpAgHXjYJ8FWBQ7UHQ0gPoY9o0jJDMCZE1PjNPAqC2KNglcd0LkwCBj2zU0zU5K06i5TAFHcWgCHgrcGWgvBYJdE53AYLAaBCdA33ZKiOX4Yh8I0H0fVOCJCCEPAI0NVDDXYziyQAei2eZjWcPAWWcY06xEqiLS2aDpS7MzeI4%2Bj0xePSMyE4yqO5BimLshyTkOE5cPw8zO3vbytg0VQSknYKtgqJQtnQjzeIs24E3wmS5IUyNlIZdDj00%2BZ027IyRPFSVhJEhs%2BTotyCCczkiq/ErzQEbY2AIBAMAUSzxmIdxbEMyiROlAholRflHloWgrRYH1QpTH0mCtVQEy4HwfT/ecfQiH1Fp8R4oxxWwNRTaRiXmEc9rxCBDvjbVdTOg6ju1ar%2BtUQamzlEa7jGiappKGa5tUZbVqodadtOW78UJK6TrVfbwa1UlofOiHjtrPqqIGoa3rOUbxsm6a/QUJQ9BWi8gY2paShBsGIFTSG11B9V8Rp5GboZ6mU1px7jXy4r6x5cqFAAd0IZAEGShDMqYYCMMwgqqPpmGLvu7TvPvMy8IgN14g9TAvV9ck6mAHlvpmqhMBDZZMCEHkAAkNyhJjgh9ZB4ncH1TfN1z41m%2BatoiAXj2a1r0AUaTsa%2BvG/q24n/3Wzalp29CoOszjItoaKBVQS46BwYhiA1R5nGUOQYoNnktiMZAAGt2oY%2BJZTE89/wiLbYVIVGzXlxG4a41XfI1zOtZI3WfX14hDYIY23bNxtLZtu2/gdhgnZdqePfor2/qeMlA7a0OPpxyfN%2BjtaIgTyX6CTmD6LJKKdYzrP6GwXP87uQvi6Gcfy6YKua/cOuSEtA3X8JMIg%2BATGmO4Cx26ck7hqJm2ouy93VprbWw9R7j0nu7GeVsCC237PbAEy9XZYItgoL2BNCYBx5EHEOpww64x%2BvjQmx9SY%2BnJmfKWl8bIKBvmnO%2Bmts5PzzviAuRcS5jzLhXauOw/71xII3S87DIFt2wsZW%2BiC9APnvvEQRz98S7DdiwCeKCSLKGIMEcEjwACSDAaK0DwHsA2nhGCWggjSLY8Q%2BxKD2EQBRC5VQUQ/OmZwDBHhe2zBlQC58MI1igaojkRV4Rig/BwBYtBOAJl4N4bgvBUCcAAFoWB2EsFYOtbh6D0LwZiHAtBxgQGbLAMQyKkEriARaPo9AlB8CmLgABOPQKYfDMikIYTgkheAsAkBoUs2StCkDyRwXgcFSzVNqaQOAsAYCIBQAPbOZAKD9wfv0UwBAuoMErnwf40Q4IQAiJoXgslmDEEpJwHgpBHl1EpLSCI2hcSvN4G6NgghaQMFoC8mpvAsC/GAM4MQad/mkCwCwYwwBxAQsRXgbk7QbRwXRdKNo7hBoIosRUe5hg8ARGIJ81wWAyWnIuAim0xAIhJEwEcTAyKTD2JMPchYVAjDAAUFiPAmABa0hpNkt5/BBAiDEOwEZ0r5BKDUGS3QcQjA8vMJYexEQ4KQAWKgQejVOAAFpaRbAAEoVDNkoAAYn2S0JqP5l2mgAfUOia7l7hOwmpYM7dw3lTC2OiJU%2BZTLzFYD1c0lsVRvAQCcCMbwcRAhTEKMULIiRkixsTRmnIsaehpv6GMa17RqgTBzcWyoHQJgFr6DEMY5a3BNCyOMLotaZj1oWAoEpqx9BpIyVkslCythapimcyu%2BFcCEHkeUvQcwqm8oWPUpgjTKALFaZIEoPoenMk6RoFMehmQ%2BB%2BhodVYyJltJmUOzgSyQArN5esrZEAkBEDcOQSgBtcW8E/coYwFQhCtQFpKgFuz6DEFCKwNYo7Tk8UrrwTA%2BA8QOP0DIGVohxAKtkIoFQ6h0VqtICwAQ39UBTrxHgl4mBGAfEJagcVjAEMIsI8wNApGSB0eAwRojaAaj4A47sYIN7Bi8d/SEWgAHUBAYRa%2B2gVj0AgE%2BiwX1ogYPnNIALKl8R/n9o4Jk0gszcmcAU0ppgI6ikqYnRAVj4lZ3xiIjosDJ49BcHnaQVZqSl0NP6M09JHBxkEcvXp69iyrB3tc4ulpIBN3bq4LusKB6j0nrPRwUN%2Bn5mCbc3GHzPhB3ooWQuiFcYmXJAcJIIAA%3D%3D)


Upgrading back to Zen 4, `foo2` gives us:

```asm
.LCPI0_2:
        .byte 4
        .byte   3
        .byte   2
        .byte   2
 .byte   1
        .byte   1
        .byte   1
        .byte   1
 .byte   0
        .byte   0
        .byte   0
        .byte 0
        .byte   0
        .byte   0
        .byte   0
 .byte   0
.LCPI0_3:
        .byte   0
        .byte   0
 .byte   0
        .byte   0
        .byte   128
        .byte 64
        .byte   32
        .byte   16
foo2:
 vbroadcasti128  ymm1, xmmword ptr [rip + .LCPI0_2]
        vptestnmb k1, ymm0, ymm0
        vpshufb ymm2, ymm1, ymm0
        vgf2p8affineqb ymm0, ymm0, qword ptr [rip + .LCPI0_3]{1to4}, 0
        vpshufb ymm0, ymm1, ymm0
        vpaddb  ymm0 {k1}, ymm0, ymm2
 ret
```

LLVM-mca says this gives us ~12 cycles of latency per iteration. Which, I notice, is higher than the ~10 latency from Zen 3.

Not sure what the best course of action is, but probably one of these things should be done:

1. The latter implementation should be used to implement `@clz(32 x u8)` and `@clz(64 x u8)`
2. The Zen 3 implementation should be lifted directly to Zen 4.

Thank you!

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy0WVuXorj2_zT2S9bUgoCKD_0QboqKind8mRUgQuQShCDiw3z2_wKrq6uqLzPzP-f0KqXd7Hv23oRfcFnSMCPka6-v9vr6F1zxiBVf9zihAS7iLx4Lmq_biJbAZwHpSagn6D3h2_dAeP49aPikkHvOCg7OGTgz1oPKvSch0JOFPfE5K3pQkWAPaqBSenDUg6P21rbJyfLc8baEofrUBAAABeFVkbVMfvJ45ehJrwy9of7Ji_euzed7G9xIUVKW_crrJLmlT1JAzjQjICjZnwnzcQIKnIWkBxWqAKF1-OmspEkQ3AFVepLRevWM8RMZ9oWWudPzZ5VlOCXBnzgIiu-xGRkvmjevQCsjgp6kA45pAnyc_FMH2ghefJ48Xm4SpMovnNEAFcEZJyVp8_dqsyD8sz7YFz-n9jU7foIL8q_Nv1qmaYqLsPMfSuJv1mxNyirhJaAZ4G3FkZRycGYFOJEMyL9aRVy-LuLLXFtZwp_Cu8Q-_708SMEAAF3tQfl5s127N8ZbnrLb4-4FrxKPNBVb7-9pKnzUdSN3XmCfUxEq7e-OA2qgeb2Kv9Yo_EJjnjz8jH9jfLPdXb8p-4FFeGX5QVnKboHXmvms5EeOX-mgWUmKtwjfQmveJ-VdqG-Ky8rz3nN_u9asCEDOC9DrqwXNQQ-q4G2x-vpHNQXhf9fWqY-Bn2Cals86KSNWJQHgOCbgLygBv_ETUoKcFIByUmBOWfYC3qvpBhotQV4QzhtQZWdW8CrDnACP-LgqCaBnUBMQsDoLCxwQwFlXhlIbU01ASPg_LUjx9wUp9j9ww-_cL17DCQBA_iT9SpZ-ToY_I_-UCD4v4T8k_w2b8P8i_2fSH8i_5_3Q-N9q1ysYDnxcvlb9u1L_TfHCH4r3lt9Z8ZwK0qv89-vnbomqc9cu8H1zNT9p6bJIavBDY8mf2XAWgB_Zft9_4k9C8NOcXL1vv5vXEN7pk97G0lsMwt_F8Obct2jhD8pwEHgf_Ic_V_b3E-KPdkKUuCmfLfttQHgEYC_pWtlnaV5x8pwfNAN_ieLPx8Z71QcCfJyBguQFCyr_VfpcZX7LihPKm1bXiYYgoTEBJft3-yX4H2-YfJaVHNyI_6rlE6ukd4FzmhJQ8uClpGnwUpCcYN4-u2WhpI8n_2fZ7k9rC0gc9Pp6pfSGqtyDWlsa8PUjfvoIv_0M9febOfBht1eShPitS23YGri3nrfO92QBlz_1TuvE8qQLRPju8GuJ9qDSJqXV9Z39X6kadV3zozrQ7nOeu6FvMvKrzD_bqB4i6kcgpDdSgup9NSUNINeK3nBCMg5wWZLUS5rWKE48QjkoCCsCUpDgBXwseh5h3j6j2mIVhZ_WNagpjwCPCKBZyYuqs1m-V7lgHJRV0T0IeUTa_5QAZw2PaBa2LdRaeWkrYswCjyUcJDSLQcvZzhSoRJznZdsA0OxB80HDl_DJ-MLa3aDZg9KjJ6GlOw8jZKrXoO9od-TWK1e1tVRdB-ocmaKj-QivDA3Zj1rlNhoipza57agNWoejGXdcAzUl9QTniLS7oqpoxtRsi9AOTfIa2TdkuiF6zPa5PWbI7PfDfHo-OIaCGiY5arSfO1o1VylS7Xxao4agxkWTGUEGdu_q1M8RWl7R1RiFTTVJ5I22i3MUG5G8ua9tB_n7PIRBONVU42Goy5RI2fCQznG-OO6MHuxD9TBeC5554OfDIj0aM2ckqbfdcTR1tN3-eqnu9qWcjENbrvWd3AlEHpqtB0ibeghN5bmVicaq7O6sikpaZcsJMubDcahp41G5HmZrqKPl_bE0m0pVq0DsWHvQhNoFEnIhmUQmDQtX8hWFg8sS6TsYr6mKRhOZV_1-Zj9CtEcaUXY1msRXlTbnEi1r69L39MRzqYsuc2ecnTRh19hi30qSiX04b7WrbXqm2Ygam9RurLuSEaPEiJOHnWazWOjcSO1wUOZjv_DOKPKQWnn7PanywDpd6ameO8mj9FSoThyk5zG9n4MG7-41DuLrabJWhw7OtsUG-WhW3-YhWmRQw8vp9DSHsXpRzhcaBZbvCLk573vz5DyXi8ibL6RF_z7pF_vSUKdKsHOspF4st2i3O4sxCRRPVza2bAk-4Yqz0yb-HlFD5Q0aao_l2Z3o5aIeLrXLxHLDuC-qd2UOs7IHzYx2MR0nWIXGEprSt0yrcSSycv2wEmp3NOXK7WGmyeGBWZkxP12MHIWT48WdKuZBdYa7iSOEK-aOmHCZ6rZ2MsTVZbFCVw3OFmHiB8I8rjX1Ah874bHrz4QB7W-ROfEPoTYJC398CG-qOw2MvoTcOcKqFiBJOs3o8ii7kWIJE-G8X2pTTTNWyBIWe10_ug_aOIhQSE6Xg79Ch4PvCoO7caXzDcR6vhnaD2I9S_Ai3MlqYzeG3FyXfZWm8zJqtvHS6PurWnkspZvX8FDYO5s4c2c8vE4dnjmXB7FKKBtSnW761q5ZJNYp0C_CsxROAhxazXplbvbR1IieFb1lvNE0_xHTaw-aRvFwDM-FyFKRjXabxz21grtn2A1dLxMUsWR9TiYJO4TF2p2ITr6dCKnG947RyJP6vO1Bk_hnh05EM1fWMrvfD_VivcVrekwNf1TcNVW_Xbx9sF2Gg0sl4qJzgu8cnD4SAw8vY4rzfDbGfb47efsrC3DAl4ehspvOBttif_ZqLO4Sdl7SyYUXq-vVGjMsFcuddy_TAe5B09lNnZF6oErowhxvIrUcW-Z2MxZFdbRX1nl_7LHTIN2J-X14OTZyMbj3_auJUCBYJ2SMN5FuX11Xs-06vF5XeTqe36rh9ra62Y1ihQtRDkdbdNuP_HiQoklyy9NrZKv66WRrxiRGKF400fQSRqOTKmf-RNyeF-JDFtMuVMwZRccLxGIhoB3elrgj38Y9aF6hnPWgKR0eOMnhSGPL8nGxOMMzde9s5IFauxFF4rqfjXdLx2WLLObUPiE05pC5nZ7paWsoox40azr33L21Hz-OyX0sY1k8ni3xseQnSzqcj6Ic6a7lQslb3Pn8xCvYg2YYRGF2Cq5kdWZFS7gSO0FDLCxX-b05L56VybEVbpvjQNum-mA-nc0C91GsBuP6uFKGlXOjg7F_bgburGpXw19WG4Mj6ahMbSvsNKDtOF3K2h0O_HgYXacpZFUzkvZ54R3q0Xp4EbJqOCb7UtOgNFx5yGhGx5Nr8ZDNobrQXPXeNybGMq_rgXx_pu_ipPfR5pmDbJqeZ4dE8ayZekuz5XMO8rK_ejxuFFVrf9O_PObREVfLugfNoXi3LGmjDHXL5LAuT4dOgmojtMpuN8NZP0bKZqgHtNbITNvMZ2PLNnzjUiNtoyM8E_R6UiKYRWPtEIaXzdxVnE7FfaZDw1HXWn_v7JweNJfpqdbVSUTgQ00cra_YlkhDzC7xei-vaftQVmvV5eoSTRN5IXvhTd1pPkImO6jQKE13yb1ooAvz-kQniKv30Bl7c2Q7zSw2CorWeMVGKFEVLZ2HSF2unJl71m0aW-42jCOCShSrbFfCDddivVnXx9A3N4_94oqdJSqRq1mqg1be8oRm-qg-34RdraKK2UaR1WdZcGTLQXTKjmGZpLph5NN6EubqJBGNfL7RLBj5m9tCHg8WYRhhxzaZpqT5DZVaD5pjZMu2g4MeNNNYq7XarWv1jpxyfQy9eOh46_q4DmYLNoxUZrFyvSP9KFTQemxNFbE-oOaQWRWiqXBHVjQ1asPfoma21eHUXvmRu0cXHaGdSVcoRSoWonwR836bUxVR3VjWxkko-upM3403kuM7F6Tc7KlL7oYjzxx0mIUOueyXCJk5XnsIGWyvL2KGkJvBjTC9rvoJTEN03i0pHibDaFmnuWsPg1pKthpc0unuOWe384fp6qIvJOV-fUOO5mu2V_trFG2Nmb2bW3Qd03u5XaeZVWIS5bfB2MbicH0Nt-rDPzRs4kxDsxjohu32saQutGZpzxnHp4vIDogZ-fU-UnOr2cdlojU8wjlN_abqzJt-O4bI7ub49TVV2GG1y8PLdobmJj5ZBlsxktnxYM1Mm0Tps1DV576HBXu3mSZ8mu6Wj3DTOKhY9ElZnAwUq4buL51NqHL_oBhtZzO3Np1I446Zz0Kzegx0w9Rm9cJlw22s7Iqi3qJnwyvOciXo1njPojua8dhipjOIhP2VW1qNHGm0U7fFfRImcqqO0dmY4tDfX5BhW6Wl1AuUjzeqQFC9ZheM8EVG8rB0rcVQpUpbTBvjEKLJTkXu-jlNYbi_CCh8bv9ofRRZ6K6yhYXms0RZtxuJEZOnsRCWDz_10U7I89Aa0PjqZmu0v8jJSbpT68JKfSqPVH-SYWd_um4SYbEa2MJoOlHvseHfjvlgRJtClbgvV0luqQvJu8rzqpypg4uQFat1vXsOzayfap4vjSePVaRKjB2cipq-mx6nyJ9aFmrzJOmvX6MPL6zd9y4PCxy0234P-_E3JEzuXpUGQvfiOhDeXmH-KSwGf4TFOqDmv4B2fSf_Z0DX3-Bf_3Oo6l-SP_1-zbP0izz_jxA5ESo_uzH41bL-CpkcvKF17wrlv4fTcVLyLPVALH4Ct35E3_4pTheeYa7g85lm5PojDn79jYNS6-BQFTmTO1xE-5zcf4W0vcFooDdUY_FV4wdc7Snyb4C0Dup6Qyn-EuE3WIGdQYI5yfzmM-DeIRutSQtkjFOfdOdBJYhoGJEC8AhnHQLxlyi8qTgXLH2Cdh-Qtzcwoo4w74Q8UnLgs6ooSesC7gAMQMvWhldxkBfMw17SAJZ1DDwiZQfYZWH5Dg8MWPb5UFN8AduItB7xNp40T0hKMv6ETb5LViUJ2lH4xtDOwreTyu4Q7AnaDQSAs-DD3YH87u7TKnxafeKVvzSa0DMnAQhoQXyeNG-j-OXj4QbOYtCwqgdfJ9aX4KsUjKQR_kK-ikM4lAfCSJa-RF_70JOg2PcHWAzOynDgD6RgKPjByDuTsyLhL_QrFKAsjOBQVMS-LL0QSZKHXjCUoSQOFH_YkwWSYpq8dGeBrAi_0LKsyFdRFCRB-ZJgjyRld7gMYUZq0N3twbYlvxRfW6E_vCose7KQ0JKX39VwypPuVBrtj3_0xVYAfM5tl9jPKf2WrqokAGcAJyErKI9SUNKUJrhok4Zvd_ilKpKvH0GqkPKo8l58lvag2Z0OPy9_5AW7dIik2flf9qD5GuDtK_y_AAAA__-XBipe">