[LLVMdev] Enabling the vectorizer for -Os

Wed Jun 5 02:15:40 PDT 2013

On 5 June 2013 04:26, Nadav Rotem <nrotem at apple.com> wrote:

> I would like to start a discussion about enabling the loop vectorizer by
> default for -Os. The loop vectorizer can accelerate many workloads and
> enabling it for -Os and -O2 has obvious performance benefits.

Hi Nadav,

As it stands, O2 is very similar to O3 with a few, more aggressive,
optimizations running, including the vectorizers. I think this is a good
rationale, at O3, I expect the compiler to throw all it's got at the
problem. O2 is somewhat more conservative, and people normally use it when
they want more stability of the code and results (regarding FP, undefined
behaviour, etc). I also use it for finding bugs on the compiler that are
introduced by O3, and making them more similar won't help that either. I'm
yet to see a good reason to enable the vectorizer by default into O2.

Code size is a different matter, though. I agree that vectorized code can
be as small (if not smaller) than scalar code and much more efficient, so
there is a clear win to make it on by default under those circumstances.
But there are catches that we need to make sure are well understood before
we do so.

> First, to vectorize some loops we have to keep the original loop around in
> order to handle the last few iterations.

Or if the runtime condition in which it could be vectorize is not valid, in
which case you have to run the original.

 Second, on x86 and possibly other targets, the encoding of vector
> instructions takes more space.
>

This may be a problem, and maybe the solution is to build a "SizeCostTable"
and do the same as we did for the CostTable. Most of the targets would just
return 1, but some should override and guess.

However, on ARM, NEON and VFP are 32-bits (either word or two half-words),
but Thumb can be 16-bit or 32-bit. So, you don't have to just model how big
the vector instructions will be, but how big the scalar instructions would
be, and not all Thumb instructions are of the same size, which makes
matters much harder.

In that sense, possibly the SizeCostTable would have to default to 2
(half-words) for most targets, and *also* manipulate scalar code, not just
vector, in a special way.

I measured the effects of vectorization on performance and binary size
> using -Os. I measured the performance on a Sandybridge and compiled our
> test suite using -mavx -f(no)-vectorize -Os.  As you can see in the
> attached data there are many workloads that benefit from vectorization.
>  Not as much as vectorizing with -O3, but still a good number of programs.
>  At the same time the code growth is minimal.

Would be good to get performance improvements *and* size increase
side-by-side in Perf.

Also, our test-suite is famous for having too large a noise, so I'd run it
at least 20x each and compare the average (keeping an eye on the std.dev),
to make sure the results are meaningful or not.

Again, would be good to have that kind of analysis in Perf, and only warn
if the increase/decrease is statistically meaningful.

Most workloads are unaffected and the total code growth for the entire test
> suite is 0.89%.  Almost all of the code growth comes from the TSVC test
> suite which contains a large number of large vectorizable loops.  I did not
> measure the compile time in this batch but I expect to see an increase in
> compile time in vectorizable loops because of the time we spend in codegen.
>

I was expecting small growth because of how conservative our vectorizer is.
Less than 1% is acceptable, in my view. For ultimate code size, users
should use -Oz, which should never have any vectorizer enabled by default
anyway.

A few considerations on embedded systems:

* 66% increase in size on an embedded system is not cheap. But LLVM haven't
been focusing on that use case so far, and we still have -Oz which does a
pretty good job at compressing code (compared to -O3), so even if we do
have existing embedded users shaving off bytes, the change in their build
system would be minimal.
* Most embedded chips have no vector units, at most single-precision FP
units or the like, so vectorization isn't going to be a hit for those
architectures anyway.

So, in a nutshell, I agree that -Os could have the vectorizer enabled by
default, but I'm yet to see a good reason to do that on -O2.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130605/52e3b517/attachment.html>