<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 24, 2014 at 4:53 PM, Arthur O'Dwyer <span dir="ltr"><<a href="mailto:arthur.j.odwyer@gmail.com" target="_blank">arthur.j.odwyer@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">On Fri, Oct 24, 2014 at 3:55 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>

><br>

> What is the harm of documenting it? By defining this behavior we are<br>

> pledging to forever support it.<br>

<br>

</span>Exactly. That's harm enough, isn't it?<br></blockquote><div><br></div><div>It looks like clang already has precedent for doing this, and a corresponding infrastructure for doing so, so it shouldn't be too problematic.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><br>

<br>

>> This allows us to optimize<br>

>>   unsigned foo(unsigned x) { return (x ? __builtin_clz(x) : 32) }<br>

>> into a single LZCNT instruction on a BMI-capable X86, if we choose.<br>

<br>

</span>I'm no expert, but isn't this kind of peephole optimization supposed<br>

to be handled by the LLVM (backend) side of things, not the Clang<br>

side?<br></blockquote><div><br></div><div>It's not an optimization that is being talked about here. It's a change of semantic meaning to define a previously undefined situation. Presumably it's a win for the user that thinks that __builtin_clz means "the best clz-like instruction available for the target", rather than the independently defined meaning given in the docs.</div><div><br></div><div><div>-- Sean Silva</div><div> </div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

The behavior of __builtin_clz(0) is undefined, so no portable program<br>

can use it. Sure, it might happen to work on your machine, but then<br>

you switch to a different architecture, or even a slightly older x86,<br>

or a year from now Intel introduces an *even faster* bit-count<br>

instruction that yields the wrong value for 0... and suddenly your<br>

program breaks, and you have to waste time tracking down the bug.<br>

<br>

Therefore, portable programs WILL use<br>

<span class=""><br>

   unsigned foo(unsigned x) { return (x ? __builtin_clz(x) : 32) }<br>

<br>

</span>Therefore, it would be awesome if the LLVM backend for these<br>

particular Intel x86 machines would generate the most-optimized LZCNT<br>

instruction sequence for "foo". This seems relatively easy and<br>

shouldn't involve any changes to the Clang front-end, and *certainly*<br>

not to the documentation.<br>

<br>

Then you get the best of both worlds — old portable code continues to<br>

work (but gets faster), and newly written code continues to be<br>

portable.<br>

<span class=""><br>

Andrea wrote:<br>

>> > My concern is that your suggested approach would force people to<br>

>> > always guard calls to __builtin_ctz/__builtin_clz against zero.<br>

>> > From a customer point of view, the compiler knows exactly if ctz and<br>

>> > clz is defined on zero. It is basically pushing the problem on the<br>

>> > customer by forcing them to guard all the calls to ctz/clz against<br>

>> > zero. We've already had a number of customer queries/complaints about<br>

>> > this and I personally don't think it is unreasonable to have ctz/clz<br>

>> > defined on zero on our target (and other x86 targets where the<br>

>> > behavior on zero is clearly defined).<br>

<br>

</span>Sounds like a good argument in favor of introducing a new builtin,<br>

spelled something like "__builtin_lzcnt", that does what the customer<br>

wants. It could be implemented internally as (x ? __builtin_clz(x) :<br>

32), and lowered to a single instruction if possible by the LLVM<br>

backend for your target.<br>

<br>

my $.02,<br>

–Arthur<br>

</blockquote></div><br></div></div>