<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/124993>124993</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Suboptimal code generation for vectorized version of llvm.ctlz() for int64 on x86-64

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          aneshlya

      </td>

    </tr>

</table>

<pre>

    LLVM generates suboptimal code for `llvm.ctlz()` on the int64 type across various x86-64 instruction sets (SSE4–AVX2) before AVX512. Performance measurements indicate that extracting individual 64-bit values from the `ymm` register and applying `lzcnt` separately to each yields a 25% improvement on AVX2 and a 124% improvement on SSE4, compared to `llvm.ctlz` vectorized implementation.

Please see the example here: https://ispc.godbolt.org/z/EEErrednx

</pre>

<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJxskkGL4zgQhX-NfCli7LLs2AcfsrvJqRcaGpq-ylLF1iBLRpI9SX79IHdgpps5CVSvHvUenwhBj5aoZ_U_rP4vE2ucnO-FpTCZu8gGp-79y8v7_zCSJS8iBQjr4JaoZ2FAOkVwdR5YUxizzbmM5sGwZdixpgBnIU4E2saGQ7wvBEJ6FwJswmu3Bri1zaHhoG2IfpVROwuBYgCG7dvbmbMzsrZgXXV6_0CGHQx0dZ7g9P5Rl5jDK_mr87OwkmAmEVZPM9kYQFulpYgEcRIR6Ba9kFHbcR9sWq3CQMMPg46wCbNSgKt3834sa4r7PKfrPY06RPIgrAKxLOaeHFLSh7QxKQItInVi7hAdkJAT3DUZFUAA1gxr0PPi3bZfldpIOT7toET-F8GeGv8F6eZFeFLJ90u3TQEbyei8fpBK22bfFam7nBUnVpxeDYlAEIj2QHQTSQUTeWLVCaYYl8CqE8MLw4sOi8xHpwZnYu78yPDyYHg5n8_ek7K3TPWV6qpOZNSXx6ot2-OxK7Opb6-K1x0KSSUvOsWPvFZdcby2ZYsolMh0jwXWRYkdlnVTlXnJuRyOXVthIXlzLBkvaBba5Hs658dMh7BSXyLvuiozYiATdjARLf2EfcoQE6e-T0uHYR0D44XRIYbfNlFHQ_3bN0yfACfGErF_tLiRD-nbXeEbxLvyE19nn7Rmqzf91xJHHad1yKWbGV6SxfM5LN79IBn3nsNKgeHlmW7r8VcAAAD__zJnJps">