[LLVMdev] Vectorization: Next Steps
Hal Finkel
hfinkel at anl.gov
Mon Feb 13 08:38:27 PST 2012
On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
> I will test your suggestion, but I designed the test case to load the
> memory directly into <4 x float> registers. So there is absolutely no
> permutation and other swizzle or move operations. Maybe the heuristic
> should not only count the depth but also the surrounding load/store
> operations.
I've attached two variants of your file, both which vectorize as you'd
expect. The core difference between these and your original file is that
I added the 'restrict' keyword so that the compiler can assume that the
arrays don't alias (or, in the first case, I made them globals). You
also probably need to specify some alignment information, otherwise the
memory operations will be scalarized in codegen.
-Hal
>
> Are the load/store operations vectorized, too? (I designed the test
> case to completely fit the SSE registers)
>
> 2012/2/10 Hal Finkel <hfinkel at anl.gov>
> Carl-Philip,
>
> The reason that this does not vectorize is that it cannot
> vectorize the
> stores; this leaves only the mul-add chains (and some chains
> with
> loads), and they only have a depth of 2 (the threshold is 6).
>
> If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
> it will
> vectorize. The reason the heuristic has such a large default
> value is to
> prevent cases where it costs more to permute all of the
> necessary values
> into and out of the vector registers than is saved by
> vectorizing. Does
> the code generated with -bb-vectorize-req-chain-depth=2 run
> faster than
> the unvectorized code?
>
> The heuristic can certainly be improved, and these kinds of
> test cases
> are very important to that improvement process.
>
> -Hal
>
> On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
> > I have a super-simple test case 4x4 matrix * 4-vector which
> gets
> > correctly unrolled, but is not vectorized by -bb-vectorize.
> (I used
> > llvm 3.1svn)
> > I attached the test case so you can see what is going wrong
> there.
> >
> > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
> > As some of you may know, I committed my basic-block
> > autovectorization
> > pass a few days ago. I encourage anyone interested
> to try it
> > out (pass
> > -vectorize to opt or -mllvm -vectorize to clang) and
> provide
> > feedback.
> > Especially in combination with
> -unroll-allow-partial, I have
> > observed
> > some significant benchmark speedups, but, I have
> also observed
> > some
> > significant slowdowns. I would like to share my
> thoughts, and
> > hopefully
> > get feedback, on next steps.
> >
> > 1. "Target Data" for vectorization - I think that in
> order to
> > improve
> > the vectorization quality, the vectorizer will need
> more
> > information
> > about the target. This information could be provided
> in the
> > form of a
> > kind of extended target data. This extended target
> data might
> > contain:
> > - What basic types can be vectorized, and how many
> of them
> > will fit
> > into (the largest) vector registers
> > - What classes of operations can be vectorized
> (division,
> > conversions /
> > sign extension, etc. are not always supported)
> > - What alignment is necessary for loads and stores
> > - Is scalar-to-vector free?
> >
> > 2. Feedback between passes - We may to implement a
> closer
> > coupling
> > between optimization passes than currently exists.
> > Specifically, I have
> > in mind two things:
> > - The vectorizer should communicate more closely
> with the
> > loop
> > unroller. First, the loop unroller should try to
> unroll to
> > preserve
> > maximal load/store alignments. Second, I think it
> would make a
> > lot of
> > sense to be able to unroll and, only if this helps
> > vectorization should
> > the unrolled version be kept in preference to the
> original.
> > With basic
> > block vectorization, it is often necessary to
> (partially)
> > unroll in
> > order to vectorize. Even when we also have real loop
> > vectorization,
> > however, I still think that it will be important for
> the loop
> > unroller
> > to communicate with the vectorizer.
> > - After vectorization, it would make sense for the
> > vectorization pass
> > to request further simplification, but only on those
> parts of
> > the code
> > that it modified.
> >
> > 3. Loop vectorization - It would be nice to have, in
> addition
> > to
> > basic-block vectorization, a more-traditional loop
> > vectorization pass. I
> > think that we'll need a better loop analysis pass in
> order for
> > this to
> > happen. Some of this was started in
> LoopDependenceAnalysis,
> > but that
> > pass is not yet finished. We'll need something like
> this to
> > recognize
> > affine memory references, etc.
> >
> > I look forward to hearing everyone's thoughts.
> >
> > -Hal
> >
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix2.c
Type: text/x-csrc
Size: 424 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix3.c
Type: text/x-csrc
Size: 480 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment-0001.c>
More information about the llvm-dev
mailing list