[llvm-dev] [RFC][LV][VPlan] Proposal for Outer Loop Vectorization Implementation Plan

Tue Jan 16 04:04:20 PST 2018

On 16 January 2018 at 01:08, Hal Finkel <hfinkel at anl.gov> wrote:
> I certainly understand what you're saying, but, as you point out, many
> of these are existing bugs that are being exposed by other changes (and
> they're seemingly all over the map). My general feeling is that the more
> limited the applicability of a particular transform the buggier it will
> tend to be. The work here to allow vectorization of an every-wider set
> of inputs will really help to expose, and thus help us eliminate, bugs.

Absolutely agreed.

We haven't stopped working on the vectoriser but for the past few
years it feels as if we're always trading performance numbers all over
the place and not progressing.

We need better, more consistent, analysis. We need a generic approach.
We need more powerful approaches. We need a single pipeline that can
decide between clear costs, not luck.

The work Hideki/Ayal/Gil are doing covers most of those topics. It implements:

1. VPlan: which will help us understand more accurate costs and pick
the best choice, not the first profitable one
2. Outer loop: which will allow us to look at the loop as a complete
set, not a bunch of inner loops

The work Tobi & the Polly guys are doing covers:

3. Fantastic analysis and powerful transformations
4. Exposing code for other passes to profit

Linaro is looking on HPC workloads (mainly core loops [1]) and we
found that loop distribution would be very profitable to ease register
allocation in big loops, but that needs whole-loop analysis.

But, as was said before, we're lacking in understanding and
organisation. To be able to profit from all of those advances, we need
to understand the loop better, and for that, powerful analysis needs
to happen.

Polly has some powerful ones, but we haven't plugged that in properly.
The work to plug Polly into LLVM and make it an integral part of the
pipeline is important so that we can use parts of its analysis tools
to benefit other passes.

But we also need more alias analysis, inter-procedural access pattern
analysis etc.

> As such, one of the largest benefits of adding the
> function-vectorization work (https://reviews.llvm.org/D22792), and
> outer-loop vectorization capabilities, will be making it easier to throw
> essentially-arbitrary inputs at the vectorizer (and have it do
> something), and thus, hit it more effectively with automated testing.

Function vectorisation is important, but whole-loop analysis
(including outer-loop, distribution, fusion) can open more doors to
new patterns.

Now, the real problem here is below...

> Maybe we can do a better job, even with the current capabilities, of
> automatically generating data-parallel loops with reductions of various
> kinds? I'm thinking about automated testing because, AFAIK, the
> vectorizer is already run through almost all of the relevant benchmarks
> and test suites, and even if we add a few more, we probably need to
> increase the test coverage by a lot more than that.

Without a wider understanding of what's missing and how to improve,
it's hard to know what path to take.

For a while I was running simple things (like Livermore Loops) and
digging specific details, and at every step I realised that I needed
better analysis, but ended up settling for a simplified version just
to get that one case vectorised.

After I stopped working on this, I continued reviewing performance
patches and what ended up happening is that we're always accepting
changes that give positive geomean, but which could also push some of
the past gains down considerably.

So much so that my current benchmarks show LLVM 5 performing worse on
almost all cases comparing to LLVM 4. This is worrying.

None of those benchmarks were done in a standard way, so I'm not sure
how to replicate, which means they're worthless and in summary, I have
wasted a lot of time.

That is why our current focus is to make sure we're all benchmarking
the things that make sense in a way that makes sense.

Our HCQC tool [1] is one way of doing that, but we need more, better
analysis of benchmark results, as well as what benchmarks we care and
how we run them.

Every arch / sub-arch / vendor / board tuple has special rules, and we
need to compare the best on each, not the standard on all. But once we
get the numbers we need to compare as apples-to-apples, and that's
sometimes not possible.

> Do you have ideas about how we can have better testing in this area
> otherwise?

I'd like to gather information about what benchmarks we really care,
how we run them and what analysis we should all do on them. I can't
share my raw numbers with you, but if we agree on a method, we can
share gains and know that they're as close as possible to be
meaningful.

For this thread, we can focus on loop benchmarks. Our team is focusing
on HPC workloads, so we will worry more about heavy loops than
"Hacker's delight" transformations, but we have to make sure we don't
break other people's stuff, so we need *all* in a bundle.

I think the test-suite in benchmark mode has a lot of potential to
become the package that we run for validation (before commit), but
that needs a lot of love before we can trust its results. We need more
relevant benchmarks, better suited results and analysis so that it can
work with the current fantastic visualisation LNT gives us.

This, IMO, together with whole-loop analysis, should be our priority
for LLVM 7 (because 6 is gone... :)

cheers,
--renato

[1] https://github.com/Linaro/hcqc