[llvm-commits] [PATCH] BasicBlock Autovectorization Pass
Hal Finkel
hfinkel at anl.gov
Tue Nov 8 14:29:50 PST 2011
On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> >>> I've attached the latest version of my autovectorization patch.
> >>>
> >>> Working through the test suite has proved to be a productive
> >>> experience ;) -- And almost all of the bugs that it revealed have now
> >>> been fixed. There are still two programs that don't compile with
> >>> vectorization turned on, and I'm working on those now, but in case
> >>> anyone feels like playing with vectorization, this patch will probably
> >>> work for you.
> >>
> >> Hey Hal,
> >>
> >> those are great news. Especially as the numbers seem to show that
> >> vectorization has a significant performance impact. What did you compare
> >> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
> >
> > Yes. [I've tested the current patch directly using opt -vectorize
> > -unroll-allow-partial; for running the test suite I recompiled
> > llvm/clang to hardcode the options as I wanted them].
>
> You should not need to hack clang. As shown above, you should be able to
> pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your
> CFLAGS. Did you run the test suite at -O3?
That's good to know. Yes, it was run at -O3.
>
> Also I think it would be interesting to compare for the test-suite the
> performance of
> 'clang -O3 -mllvm -unroll-allow-partial' with
> 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show
> how much of the runtime overhead is due to the unrolling (produces more
> code that needs to be optimized) and which part is due to vectorization.
> The same counts for the speedup. How much is caused by
> unrolling and how much is actually caused by your pass.
Just turning on partial unrolling yields, on average, a 4.7% slowdown
(0% median), and a 28.6% increase in compile time (21.8% median).
>
> >>> The largest three performance speedups are:
> >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
With just partial unrolling: 64.8% speedup
> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
With just partial unrolling: 58.2% speedup
> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
With just partial unrolling: 11.8% speedup
> >>>
> >>> The largest three performance slowdowns are:
> >>> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> >>> 114% slowdown
With just partial unrolling: 0% slowdown (47.8% slowdown in compile
time)
> >>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> >>> slowdown
With just partial unrolling: 8.3% speedup
> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
With just partial unrolling: 8.3% speedup
> >>>
Here are the top-3 lists comparing vectorization with partial unrolling
to just with partial unrolling:
Top speedups:
SingleSource/Benchmarks/Misc/flops-7 - 44.1% speedup
SingleSource/Benchmarks/Misc/flops-1 - 42.3% speedup
MultiSource/Applications/sqlite3/sqlite3 - 42.3% speedup
Top slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 81.8%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 79% slowdown
Top compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
832% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 600% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 263% slowdown
For this comparison, however (unlike comparing to plain -O3), there are
a significant number of compile-time speedups (I guess that this is
because vectorization can reduce the number of instructions processed by
later passes). Top compile-time speedups:
SingleSource/Benchmarks/Stanford/Oscar - 46% speedup
SingleSource/Regression/C/bigstack - 45% speedup
SingleSource/Benchmarks/BenchmarkGame/fannkuch - 45% speedup
> >> Interesting. Do you understand what causes these slowdowns? Can your
> >> heuristic be improved?
> >
> > I've not specifically looked at these cases.
> >
> > Generally, I've observed slowdowns from the introduction of too many
> > permutations per chain (the chain selection procedure can be changed to
> > help this, and I'll work on that). Also, sometimes legalization of
> > non-native vector operations creates inefficient code, and I'll also
> > look at these cases in more detail.
> Sure. That sounds reasonable. I am confident you can improve this gradually.
>
> >> If I understood correctly it seems your vectorizer has quadratic
> >> complexity which may cause large slowdowns. Do you think it may be
> >> useful/possible to make it linear by introducing a constant upper bound
> >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
> >> most of the vectorization opportunities are close by (in some sense),
> >> such that we get most of the speedup by locking at a subset of the problem.
> >
> > Yes, I agree. That makes a lot of sense.
> >
> > What would be even better is if the loop unroller would intermix
> > statements from the loops where possible instead of leaving it to the
> > vectorizer to do all of the grouping after the fact. That, I fear, is a
> > whole other project.
>
> First, I do not understand the 'even better' here. To me it would be
> great if on one hand the vectorizer could be constrained, such that the
> compile time is predictable. And on the other hand, we could make the
> loop unroller create code in a way such that the constrained vectorizer
> still performs the relevant transformations.
I was not clear; I meant that imposing a cut off is likely to work, but
it would work even better if the loop unroller produced code
more-conducive to vectorization.
>
> Also, what do you mean with 'if the loop unroller would intermix
> statements from the loops where possible'. Are you referring to the
> grouped unrolling as shown in my the last mail?
Yes.
Also, the code necessary to take advantage of grouped unrolling already
exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
causes the vectorizer to stop searching for instruction pairings after
the first use of an instruction.
>
>
> >>> Compared to previous patches, which had a minimum required chain length
> >>> of 3 or 4, I've now made the default 6. While using a chain length of 4
> >>> worked well for targeted benchmarks, it caused an overall slowdown on
> >>> almost all test-suite programs. Using a minimum length of 6 causes, on
> >>> average, a speedup; so I think that is a better default choice.
> >>
> >> I also try to understand if it is possible to use your vectorizer for
> >> Polly. My idea is to do some clever loop unrolling.
> >>
> >> Starting from this loop.
> >>
> >> for (int i = 0; i< 4; i++)
> >> A[i] += 1;
> >> A[i] = B[i] + 3;
> >> C[i] = A[i];
> >>
> >> The classical unroller would create this code:
> >>
> >> A[0] += 1;
> >> A[0] = B[i] + 3;
> >> C[0] = A[i];
> >>
> >> A[1] += 1;
> >> A[1] = B[i] + 3;
> >> C[1] = A[i];
> >>
> >> A[2] += 1;
> >> A[2] = B[i] + 3;
> >> C[2] = A[i];
> >>
> >> A[3] += 1;
> >> A[3] = B[i] + 3;
> >> C[3] = A[i];
> >>
> >> However, in case I can prove this loop is parallel, I want to create
> >> this code:
> >>
> >> A[0] += 1;
> >> A[1] += 1;
> >> A[2] += 1;
> >> A[3] += 1;
> >>
> >> A[0] = B[i] + 3;
> >> A[1] = B[i] + 3;
> >> A[2] = B[i] + 3;
> >> A[3] = B[i] + 3;
> >>
> >> C[0] = A[i];
> >> C[1] = A[i];
> >> C[2] = A[i];
> >> C[3] = A[i];
> >>
> >> I assume this will allow the vectorization of test cases, that failed
> >> because of possible aliasing. However, I am more interested, if the
> >> execution order change could also improve the vectorization outcome or
> >> reduce compile time overhead of your vectorizer.
> >
> > Yes, this would certainly help.
>
> In which way? How much? Would it be possible to restrict the vectorizer
> such that it's complexity is linear, while at the same time we can
> ensure it will still vectorize code that is 'group unrolled' (which
> means unrolled in the way shown above?
It would help because we could stop the pairing search after an
instructions first use; and that is significantly faster and generates
many fewer potential pairings that need to be considered by the later
stages. Exactly "how much" it would help I can't yet answer; I'd need to
construct actual benchmarks for that. If you have some group-unrolled
code, then it will be simple to test: just pass -bb-vectorize-fast-dep.
>
> > By the way, the current implementation, by default, it will create
> > unaligned vector loads and stores, but these are generally broken up by
> > the legalizer. This behavior can be suppressed using the
> > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> > chose the unrolling factor to preserve the maximum available alignment,
> > but I don't think that it currently does so.
> I don't know, but it sounds like a good thing to do.
>
> > One problem with the current implementation is that it relies on
> > GetPointerBaseWithConstantOffset to determine if two loads or stores
> > share the same base address. This fails with partially-unrolled loops
> > because it cannot "undo" all of the additions to the offset induction
> > variable in order to understand that some of the loads and stores are
> > really adjacent in memory. This is something that I think can be
> > improved within the vectorizer itself, and I'm planning on doing some
> > work on this in the future.
> Here you may also want to look into ScalarEvolution. Basically two loads
> access adjacent memory if the difference of the scalar evolution of the
> two load addresses is equal to sizeof(element_type). ScalarEvolution
> should be a lot more general than GetPointerBaseWithConstantOffset().
Thanks! That sounds great; I'll have to look at that.
-Hal
>
> Cheers
> Tobi
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-commits
mailing list