[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Thu Nov 10 13:25:28 PST 2011

On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> >>> I've attached the latest version of my autovectorization patch.
> >>>
> >>> Working through the test suite has proved to be a productive
> >>> experience ;) -- And almost all of the bugs that it revealed have now
> >>> been fixed. There are still two programs that don't compile with
> >>> vectorization turned on, and I'm working on those now, but in case
> >>> anyone feels like playing with vectorization, this patch will probably
> >>> work for you.
> >>
> >> Hey Hal,
> >>
> >> those are great news. Especially as the numbers seem to show that
> >> vectorization has a significant performance impact. What did you compare
> >> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
> >
> > Yes. [I've tested the current patch directly using opt -vectorize
> > -unroll-allow-partial; for running the test suite I recompiled
> > llvm/clang to hardcode the options as I wanted them].
> 
> You should not need to hack clang. As shown above, you should be able to
> pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your 
> CFLAGS. 

-mllvm -unroll-allow-partial works because -unroll-allow-partial is an
option to the pass, but I had added -vectorize as an option to opt
itself, and those options do not get picked up by clang. Would it be
better to add -vectorize as an option in IPO/PassManagerBuilder? Maybe
I'll try doing it that way instead.

 -Hal

> Did you run the test suite at -O3?
> 
> Also I think it would be interesting to compare for the test-suite the 
> performance of
> 'clang -O3 -mllvm -unroll-allow-partial' with
> 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show 
> how much of the runtime overhead is due to the unrolling (produces more 
> code that needs to be optimized) and which part is due to vectorization. 
> The same counts for the speedup. How much is caused by
> unrolling and how much is actually caused by your pass.
> 
> >>> The largest three performance speedups are:
> >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
> >>>
> >>> The largest three performance slowdowns are:
> >>> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> >>> 114% slowdown
> >>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> >>> slowdown
> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
> >>>
> >> Interesting. Do you understand what causes these slowdowns? Can your
> >> heuristic be improved?
> >
> > I've not specifically looked at these cases.
> >
> > Generally, I've observed slowdowns from the introduction of too many
> > permutations per chain (the chain selection procedure can be changed to
> > help this, and I'll work on that). Also, sometimes legalization of
> > non-native vector operations creates inefficient code, and I'll also
> > look at these cases in more detail.
> Sure. That sounds reasonable. I am confident you can improve this gradually.
> 
> >> If I understood correctly it seems your vectorizer has quadratic
> >> complexity which may cause large slowdowns. Do you think it may be
> >> useful/possible to make it linear by introducing a constant upper bound
> >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
> >> most of the vectorization opportunities are close by (in some sense),
> >> such that we get most of the speedup by locking at a subset of the problem.
> >
> > Yes, I agree. That makes a lot of sense.
> >
> > What would be even better is if the loop unroller would intermix
> > statements from the loops where possible instead of leaving it to the
> > vectorizer to do all of the grouping after the fact. That, I fear, is a
> > whole other project.
> 
> First, I do not understand the 'even better' here. To me it would be 
> great if on one hand the vectorizer could be constrained, such that the 
> compile time is predictable. And on the other hand, we could make the 
> loop unroller create code in a way such that the constrained vectorizer 
> still performs the relevant transformations.
> 
> Also, what do you mean with 'if the loop unroller would intermix
> statements from the loops where possible'. Are you referring to the
> grouped unrolling as shown in my the last mail?
> 
> 
> >>> Compared to previous patches, which had a minimum required chain length
> >>> of 3 or 4, I've now made the default 6. While using a chain length of 4
> >>> worked well for targeted benchmarks, it caused an overall slowdown on
> >>> almost all test-suite programs. Using a minimum length of 6 causes, on
> >>> average, a speedup; so I think that is a better default choice.
> >>
> >> I also try to understand if it is possible to use your vectorizer for
> >> Polly. My idea is to do some clever loop unrolling.
> >>
> >> Starting from this loop.
> >>
> >> for (int i = 0; i<  4; i++)
> >>      A[i] += 1;
> >>      A[i] = B[i] + 3;
> >>      C[i] = A[i];
> >>
> >> The classical unroller would create this code:
> >>
> >>      A[0] += 1;
> >>      A[0] = B[i] + 3;
> >>      C[0] = A[i];
> >>
> >>      A[1] += 1;
> >>      A[1] = B[i] + 3;
> >>      C[1] = A[i];
> >>
> >>      A[2] += 1;
> >>      A[2] = B[i] + 3;
> >>      C[2] = A[i];
> >>
> >>      A[3] += 1;
> >>      A[3] = B[i] + 3;
> >>      C[3] = A[i];
> >>
> >> However, in case I can prove this loop is parallel, I want to create
> >> this code:
> >>
> >>      A[0] += 1;
> >>      A[1] += 1;
> >>      A[2] += 1;
> >>      A[3] += 1;
> >>
> >>      A[0] = B[i] + 3;
> >>      A[1] = B[i] + 3;
> >>      A[2] = B[i] + 3;
> >>      A[3] = B[i] + 3;
> >>
> >>      C[0] = A[i];
> >>      C[1] = A[i];
> >>      C[2] = A[i];
> >>      C[3] = A[i];
> >>
> >> I assume this will allow the vectorization of test cases, that failed
> >> because of possible aliasing. However, I am more interested, if the
> >> execution order change could also improve the vectorization outcome or
> >> reduce compile time overhead of your vectorizer.
> >
> > Yes, this would certainly help.
> 
> In which way? How much? Would it be possible to restrict the vectorizer 
> such that it's complexity is linear, while at the same time we can 
> ensure it will still vectorize code that is 'group unrolled' (which 
> means unrolled in the way shown above?
> 
> > By the way, the current implementation, by default, it will create
> > unaligned vector loads and stores, but these are generally broken up by
> > the legalizer. This behavior can be suppressed using the
> > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> > chose the unrolling factor to preserve the maximum available alignment,
> > but I don't think that it currently does so.
> I don't know, but it sounds like a good thing to do.
> 
> > One problem with the current implementation is that it relies on
> > GetPointerBaseWithConstantOffset to determine if two loads or stores
> > share the same base address. This fails with partially-unrolled loops
> > because it cannot "undo" all of the additions to the offset induction
> > variable in order to understand that some of the loads and stores are
> > really adjacent in memory. This is something that I think can be
> > improved within the vectorizer itself, and I'm planning on doing some
> > work on this in the future.
> Here you may also want to look into ScalarEvolution. Basically two loads
> access adjacent memory if the difference of the scalar evolution of the 
> two load addresses is equal to sizeof(element_type). ScalarEvolution 
> should be a lot more general than GetPointerBaseWithConstantOffset().
> 
> Cheers
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory