<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/124993>124993</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Suboptimal code generation for vectorized version of llvm.ctlz() for int64 on x86-64
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
aneshlya
</td>
</tr>
</table>
<pre>
LLVM generates suboptimal code for `llvm.ctlz()` on the int64 type across various x86-64 instruction sets (SSE4–AVX2) before AVX512. Performance measurements indicate that extracting individual 64-bit values from the `ymm` register and applying `lzcnt` separately to each yields a 25% improvement on AVX2 and a 124% improvement on SSE4, compared to `llvm.ctlz` vectorized implementation.
Please see the example here: https://ispc.godbolt.org/z/EEErrednx
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJxskkGL4zgQhX-NfCli7LLs2AcfsrvJqRcaGpq-ylLF1iBLRpI9SX79IHdgpps5CVSvHvUenwhBj5aoZ_U_rP4vE2ucnO-FpTCZu8gGp-79y8v7_zCSJS8iBQjr4JaoZ2FAOkVwdR5YUxizzbmM5sGwZdixpgBnIU4E2saGQ7wvBEJ6FwJswmu3Bri1zaHhoG2IfpVROwuBYgCG7dvbmbMzsrZgXXV6_0CGHQx0dZ7g9P5Rl5jDK_mr87OwkmAmEVZPM9kYQFulpYgEcRIR6Ba9kFHbcR9sWq3CQMMPg46wCbNSgKt3834sa4r7PKfrPY06RPIgrAKxLOaeHFLSh7QxKQItInVi7hAdkJAT3DUZFUAA1gxr0PPi3bZfldpIOT7toET-F8GeGv8F6eZFeFLJ90u3TQEbyei8fpBK22bfFam7nBUnVpxeDYlAEIj2QHQTSQUTeWLVCaYYl8CqE8MLw4sOi8xHpwZnYu78yPDyYHg5n8_ek7K3TPWV6qpOZNSXx6ot2-OxK7Opb6-K1x0KSSUvOsWPvFZdcby2ZYsolMh0jwXWRYkdlnVTlXnJuRyOXVthIXlzLBkvaBba5Hs658dMh7BSXyLvuiozYiATdjARLf2EfcoQE6e-T0uHYR0D44XRIYbfNlFHQ_3bN0yfACfGErF_tLiRD-nbXeEbxLvyE19nn7Rmqzf91xJHHad1yKWbGV6SxfM5LN79IBn3nsNKgeHlmW7r8VcAAAD__zJnJps">