[PATCH][X86] __builtin_ctz/clz sometimed defined for zero input

Robinson, Paul Paul_Robinson at playstation.sony.com
Sun Oct 26 17:56:42 PDT 2014

I don't know about the "public" contract, but Clang's own lzcntintrin.h uses __builtin_clz without a runtime zero check, protected by the __LZCNT__ preprocessor define.  So Clang's own headers are assuming that __builtin_clz will be treated as defined for zero input on these machines.  That makes my patch a correctness fix, not 'just' an optimization.

If you use Clang's intrinsics (either the ones from Intrin.h or the ones from lzcntintrin.h) you shouldn't need to add your own zero checks; but that's the public contract for the intrinsics, not the builtins.

If we leave __builtin_clz defined as-is, you should always add your own zero check; on appropriate machines sometimes Clang will be able to tell and optimize it away.  But that's target-dependent and I don't know that it would be all that smart to codify that target-dependent behavior as part of the public contract.

From: Sean Silva [mailto:chisophugis at gmail.com]
Sent: Friday, October 24, 2014 5:46 PM
To: Arthur O'Dwyer
Cc: Robinson, Paul; cfe-commits at cs.uiuc.edu
Subject: Re: [PATCH][X86] __builtin_ctz/clz sometimed defined for zero input

On Fri, Oct 24, 2014 at 4:53 PM, Arthur O'Dwyer <arthur.j.odwyer at gmail.com<mailto:arthur.j.odwyer at gmail.com>> wrote:
On Fri, Oct 24, 2014 at 3:55 PM, Sean Silva <chisophugis at gmail.com<mailto:chisophugis at gmail.com>> wrote:
> What is the harm of documenting it? By defining this behavior we are
> pledging to forever support it.

Exactly. That's harm enough, isn't it?

It looks like clang already has precedent for doing this, and a corresponding infrastructure for doing so, so it shouldn't be too problematic.

>> This allows us to optimize
>>   unsigned foo(unsigned x) { return (x ? __builtin_clz(x) : 32) }
>> into a single LZCNT instruction on a BMI-capable X86, if we choose.

I'm no expert, but isn't this kind of peephole optimization supposed
to be handled by the LLVM (backend) side of things, not the Clang

It's not an optimization that is being talked about here. It's a change of semantic meaning to define a previously undefined situation. Presumably it's a win for the user that thinks that __builtin_clz means "the best clz-like instruction available for the target", rather than the independently defined meaning given in the docs.

-- Sean Silva

The behavior of __builtin_clz(0) is undefined, so no portable program
can use it. Sure, it might happen to work on your machine, but then
you switch to a different architecture, or even a slightly older x86,
or a year from now Intel introduces an *even faster* bit-count
instruction that yields the wrong value for 0... and suddenly your
program breaks, and you have to waste time tracking down the bug.

Therefore, portable programs WILL use

   unsigned foo(unsigned x) { return (x ? __builtin_clz(x) : 32) }

Therefore, it would be awesome if the LLVM backend for these
particular Intel x86 machines would generate the most-optimized LZCNT
instruction sequence for "foo". This seems relatively easy and
shouldn't involve any changes to the Clang front-end, and *certainly*
not to the documentation.

Then you get the best of both worlds — old portable code continues to
work (but gets faster), and newly written code continues to be

Andrea wrote:
>> > My concern is that your suggested approach would force people to
>> > always guard calls to __builtin_ctz/__builtin_clz against zero.
>> > From a customer point of view, the compiler knows exactly if ctz and
>> > clz is defined on zero. It is basically pushing the problem on the
>> > customer by forcing them to guard all the calls to ctz/clz against
>> > zero. We've already had a number of customer queries/complaints about
>> > this and I personally don't think it is unreasonable to have ctz/clz
>> > defined on zero on our target (and other x86 targets where the
>> > behavior on zero is clearly defined).

Sounds like a good argument in favor of introducing a new builtin,
spelled something like "__builtin_lzcnt", that does what the customer
wants. It could be implemented internally as (x ? __builtin_clz(x) :
32), and lowered to a single instruction if possible by the LLVM
backend for your target.

my $.02,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20141027/8338f93f/attachment.html>

More information about the cfe-commits mailing list