[llvm-bugs] [Bug 34843] New: Suboptimal code generation for __builtin_ctz(ll)

Thu Oct 5 02:55:50 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=34843

            Bug ID: 34843
           Summary: Suboptimal code generation for __builtin_ctz(ll)
           Product: clang
           Version: 5.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: LLVM Codegen
          Assignee: unassignedclangbugs at nondot.org
          Reporter: gcp at sjeng.org
                CC: llvm-bugs at lists.llvm.org

Right now, when no specific arch target is set, the builtin

__builtin_ctz (and long, long long variants)

will generate a bsf instruction.

This is suboptimal for AMD machines, which can do a TZCNT much faster than they
can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it
is in fact "backwards compatible" as long as the different behavior for a 0 is
fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is
why it can use BSF in the first place). 

On Intel hardware, either way is equally fast, so for a generic target it makes
sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.

At least GCC 4.8 and later are able to do this optimization and generate a REP
BSF for their generic target. Clang fails to do so. (It does generate TZCNT
with -march=znver1)

Example snippet:
https://godbolt.org/g/eXU6xf

Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the
REP BSF. So there may be a false dependency issue in some CPUs.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171005/7385cd90/attachment.html>