<div dir="ltr">All I have to say is *wow*. The vectorizer performs *remarkably* better now than it did the last time I benchmarked it. I'm stunned.<div><br></div><div>I measured -O2 and -Os, as well as -march=x86-64 and -march=corei7-avx. My hope with the latter two was to cover both worst-case and best-case in terms of the quality of the vector ISA available.</div>

<div><br></div><div>First, binary size growth. This is measured on average across a reasonably wide selection of binaries including large servers, video codecs, image processing, etc.</div><div><br></div><div>O2, x86-64: 1% larger w/ vectorizer</div>

<div>O2, corei7-avx: 1.2% larger</div><div>Os, x86-64: 0.1% larger</div><div>Os, corei7-avx: < 0.1% larger</div><div><br></div><div>This is incredibly impressive IMO. =]</div><div><br></div><div>The performance numbers are also pretty good. There are a couple of minor regressions, only one significant one. That one happens to be open source: <a href="https://code.google.com/p/snappy/source/browse/trunk/snappy.cc">https://code.google.com/p/snappy/source/browse/trunk/snappy.cc</a> this slows down -- the vectorizer vectorizes a cold loop, which then gets inlined and blocks subsequent inlining. (Many thanks to Ben Kramer for pointing out the cause so quickly for me.) But there are a lot of potential solutions to this problem:</div>

<div><br></div><div>1) vectorize after inlining -- this has some problems (code growth mostly) but we might be able to solve them.</div><div>2) mark the cold path as cold so the optimizer is aware of it (tested this, it seems to work, but i'm still experimenting)</div>

<div>3) rewrite this part of snappy to be fundamentally better (the code as it is doesn't make a lot of sense to me, but i'm not an expert on it and will need time to figure out the best way to solve the issue)</div>

<div><br></div><div>I'm actually happy with any of the 3, although #2 isn't terribly satisfying. But even if that's the result, I can live with it.</div><div><br></div><div>So essentially, I think you should turn the vectorizer on completely. What's left seem very much like small isolated issues.</div>

<div><br></div><div>Thanks for driving this whole thing and giving me time to do some evaluation. I'm really thrilled by the result.</div><div>-Chandler</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">

On Fri, Jun 14, 2013 at 11:53 AM, Chandler Carruth <span dir="ltr"><<a href="mailto:chandlerc@google.com" target="_blank">chandlerc@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<p dir="ltr">Sorry for the delays here. I am running our benchmark suite and will have data in a day or so.</p><div class="HOEnZb"><div class="h5">

<div class="gmail_quote">On Jun 13, 2013 9:40 PM, "Nadav Rotem" <<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Hi,<br>

<br>

Last week I wrote llvm-dev and presented data that shows how enabling the vectorizer on -Os can improve the performance of many workloads and that it has negligible effects on code size.  I also added a command line switch to make it easier for people to benchmark the vectorizer using -Os directly from clang without changing LLVM.  Has anyone done any benchmarks on -Os + vectorization ?<br>


<br>

Thanks,<br>

Nadav<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</blockquote></div>

</div></div></blockquote></div><br></div>