[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tue Nov 8 03:12:24 PST 2011

On 11/08/2011 11:45 AM, Hal Finkel wrote:
> I've attached the latest version of my autovectorization patch.
>
> Working through the test suite has proved to be a productive
> experience ;) -- And almost all of the bugs that it revealed have now
> been fixed. There are still two programs that don't compile with
> vectorization turned on, and I'm working on those now, but in case
> anyone feels like playing with vectorization, this patch will probably
> work for you.

Hey Hal,

those are great news. Especially as the numbers seem to show that 
vectorization has a significant performance impact. What did you compare 
exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?

> The largest three performance speedups are:
> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
>
> The largest three performance slowdowns are:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 114% slowdown
> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> slowdown
> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
>
Interesting. Do you understand what causes these slowdowns? Can your 
heuristic be improved?

> Largest three compile-time slowdowns:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 1276% slowdown
> SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
> MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Yes, that is a lot. Do you understand if this time is invested well 
(does it give significant speedups)?

If I understood correctly it seems your vectorizer has quadratic 
complexity which may cause large slowdowns. Do you think it may be 
useful/possible to make it linear by introducing a constant upper bound 
somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and 
most of the vectorization opportunities are close by (in some sense), 
such that we get most of the speedup by locking at a subset of the problem.

> Not everything slows down, MultiSource/Benchmarks/Prolangs-C
> ++/city/city, for example, compiles 10% faster with vectorization
> enabled; but, for the most part, things certainly take longer to compile
> with vectorization enabled. The average slowdown over all tests was 29%,
> the median was 11%. On the other hand, the average speedup over all
> tests was 5.2%, the median was 1.3%.
Nice. I think this is a great start.

> Compared to previous patches, which had a minimum required chain length
> of 3 or 4, I've now made the default 6. While using a chain length of 4
> worked well for targeted benchmarks, it caused an overall slowdown on
> almost all test-suite programs. Using a minimum length of 6 causes, on
> average, a speedup; so I think that is a better default choice.

I also try to understand if it is possible to use your vectorizer for 
Polly. My idea is to do some clever loop unrolling.

Starting from this loop.

for (int i = 0; i < 4; i++)
    A[i] += 1;
    A[i] = B[i] + 3;
    C[i] = A[i];

The classical unroller would create this code:

    A[0] += 1;
    A[0] = B[i] + 3;
    C[0] = A[i];

    A[1] += 1;
    A[1] = B[i] + 3;
    C[1] = A[i];

    A[2] += 1;
    A[2] = B[i] + 3;
    C[2] = A[i];

    A[3] += 1;
    A[3] = B[i] + 3;
    C[3] = A[i];

However, in case I can prove this loop is parallel, I want to create 
this code:

    A[0] += 1;
    A[1] += 1;
    A[2] += 1;
    A[3] += 1;

    A[0] = B[i] + 3;
    A[1] = B[i] + 3;
    A[2] = B[i] + 3;
    A[3] = B[i] + 3;

    C[0] = A[i];
    C[1] = A[i];
    C[2] = A[i];
    C[3] = A[i];

I assume this will allow the vectorization of test cases, that failed 
because of possible aliasing. However, I am more interested, if the
execution order change could also improve the vectorization outcome or 
reduce compile time overhead of your vectorizer.

Thanks for working on the vectorization
Cheers

Tobi