[LLVMdev] Enabling the vectorizer for -Os

Thu Jun 6 02:07:09 PDT 2013

On Wed, Jun 5, 2013 at 5:51 PM, Nadav Rotem <nrotem at apple.com> wrote:

> Hi,
>
> Thanks for the feedback.  I think that we agree that vectorization on -Os
> can benefit many programs.
>

FWIW, I don't yet agree.

Your tables show many programs growing in code size by over 20%. While
there is associated performance improvements, it isn't clear that this is a
good tradeoff. Historically, optimizations which optimize as a direct
result of growing code size have *not* been an acceptable tradeoff in -Os.

>From Owen's email, a characterization I agree with:
"My understanding is that -Os is intended to be
optimized-without-sacrificing-code-size."

The way I would phrase the difference between -Os and -Oz is similar: with
-Os we don't *grow* the code size significantly even if it gives
significant performance gains, whereas with -Oz we *shrink* the code size
even if it means significant performance loss.

Neither of these concepts for -Os would seem to argue for running the
vectorizer given the numbers you posted.

> Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2
> to increase the likelihood of improving the performance ?  We have very few
> regressions on -O3 as is and with better cost models I believe that we can
> bring them close to zero, so I am not sure if it can help that much.
> Renato, I prefer not to estimate the encoding size of instructions. We know
> that vector instructions take more space to encode. Will knowing the exact
> number help us in making a better decision ? I don’t think so. On modern
> processors when running vectorizable loops, the code size of the vector
> instructions is almost never the bottleneck.
>

That has specifically not been my experience when dealing with
significantly larger and more complex application benchmarks.

The tradeoffs you show in your numbers for -Os are actually exactly what I
would expect for -O2: a willingness to grow code size (and compilation
time) in order to get performance improvements. A quick eye-balling of the
two tables seemed to show most of the size growth had associated
performance growth. This, to me, is a good early indicator that the mode of
the vectorizer is running in your -Os numbers is what we should look at
enabling for -O2.

That said, I would like to see benchmarks from a more diverse set of
applications than the nightly test suite. ;] I don't have a lot of faith in
it being representative. I'm willing to contribute some that I care about
(given enough time to collect the data), but I'd really like for other
folks with larger codebases and applications to measure code size and
performance artifacts as well.

In order to do this, and ensure we are all measuring the same thing, I
think it would be useful to have in Clang flag sets that correspond to the
various modes you are proposing. I think they are:

1) -Os + minimal-vectorize (no unrolling, etc)
2) -O2 + minimal-vectorize
3) -O2 + -fvectorize (I think? maybe you have a more specific flag here?)

Does that make sense to you and others?
-Chandler

PS: As a side note, I would personally really like to revisit my proposal
to write down what we mean for each optimization level as precisely as we
can (acknowledging that this is not very precise; it will always be a
judgement call). I think it would help these discussions stay on track.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130606/a5c6538e/attachment.html>