<div dir="ltr">On 5 June 2013 04:26, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits.</blockquote><div>


<br></div><div>Hi Nadav,</div><div><br></div><div>As it stands, O2 is very similar to O3 with a few, more aggressive, optimizations running, including the vectorizers. I think this is a good rationale, at O3, I expect the compiler to throw all it's got at the problem. O2 is somewhat more conservative, and people normally use it when they want more stability of the code and results (regarding FP, undefined behaviour, etc). I also use it for finding bugs on the compiler that are introduced by O3, and making them more similar won't help that either. I'm yet to see a good reason to enable the vectorizer by default into O2.<br>

</div>

<div><br></div><div>Code size is a different matter, though. I agree that vectorized code can be as small (if not smaller) than scalar code and much more efficient, so there is a clear win to make it on by default under those circumstances. But there are catches that we need to make sure are well understood before we do so.</div>


<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">First, to vectorize some loops we have to keep the original loop around in order to handle the last few iterations.</blockquote>


<div><br></div><div>Or if the runtime condition in which it could be vectorize is not valid, in which case you have to run the original.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


  Second, on x86 and possibly other targets, the encoding of vector instructions takes more space.<br></blockquote><div><br></div><div>This may be a problem, and maybe the solution is to build a "SizeCostTable" and do the same as we did for the CostTable. Most of the targets would just return 1, but some should override and guess. </div>

<div><br></div><div>However, on ARM, NEON and VFP are 32-bits (either word or two half-words), but Thumb can be 16-bit or 32-bit. So, you don't have to just model how big the vector instructions will be, but how big the scalar instructions would be, and not all Thumb instructions are of the same size, which makes matters much harder.</div>

<div><br></div><div style>In that sense, possibly the SizeCostTable would have to default to 2 (half-words) for most targets, and *also* manipulate scalar code, not just vector, in a special way.</div><div><br></div><div>

<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I measured the effects of vectorization on performance and binary size using -Os. I measured the performance on a Sandybridge and compiled our test suite using -mavx -f(no)-vectorize -Os.  As you can see in the attached data there are many workloads that benefit from vectorization.  Not as much as vectorizing with -O3, but still a good number of programs.  At the same time the code growth is minimal.</blockquote>

<div><br></div><div style>Would be good to get performance improvements *and* size increase side-by-side in Perf.</div><div style><br></div><div style>Also, our test-suite is famous for having too large a noise, so I'd run it at least 20x each and compare the average (keeping an eye on the std.dev), to make sure the results are meaningful or not.</div>

<div style><br></div><div style>Again, would be good to have that kind of analysis in Perf, and only warn if the increase/decrease is statistically meaningful.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Most workloads are unaffected and the total code growth for the entire test suite is 0.89%.  Almost all of the code growth comes from the TSVC test suite which contains a large number of large vectorizable loops.  I did not measure the compile time in this batch but I expect to see an increase in compile time in vectorizable loops because of the time we spend in codegen.<br>

</blockquote><div><br></div><div style>I was expecting small growth because of how conservative our vectorizer is. Less than 1% is acceptable, in my view. For ultimate code size, users should use -Oz, which should never have any vectorizer enabled by default anyway.</div>

<div style><br></div><div style>A few considerations on embedded systems:</div><div style><br></div><div style>* 66% increase in size on an embedded system is not cheap. But LLVM haven't been focusing on that use case so far, and we still have -Oz which does a pretty good job at compressing code (compared to -O3), so even if we do have existing embedded users shaving off bytes, the change in their build system would be minimal.</div>

<div style>* Most embedded chips have no vector units, at most single-precision FP units or the like, so vectorization isn't going to be a hit for those architectures anyway.</div><div style><br></div><div style>So, in a nutshell, I agree that -Os could have the vectorizer enabled by default, but I'm yet to see a good reason to do that on -O2.</div>

<div style><br></div><div style>cheers,</div><div style>--renato</div></div></div></div>