[LLVMdev] Vectorization: Next Steps

Mon Feb 13 08:38:27 PST 2012

On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
> I will test your suggestion, but I designed the test case to load the
> memory directly into <4 x float> registers. So there is absolutely no
> permutation and other swizzle or move operations. Maybe the heuristic
> should not only count the depth but also the surrounding load/store
> operations.

I've attached two variants of your file, both which vectorize as you'd
expect. The core difference between these and your original file is that
I added the 'restrict' keyword so that the compiler can assume that the
arrays don't alias (or, in the first case, I made them globals). You
also probably need to specify some alignment information, otherwise the
memory operations will be scalarized in codegen.

 -Hal

> 
> Are the load/store operations vectorized, too? (I designed the test
> case to completely fit the SSE registers)
> 
> 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>         Carl-Philip,
>         
>         The reason that this does not vectorize is that it cannot
>         vectorize the
>         stores; this leaves only the mul-add chains (and some chains
>         with
>         loads), and they only have a depth of 2 (the threshold is 6).
>         
>         If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
>         it will
>         vectorize. The reason the heuristic has such a large default
>         value is to
>         prevent cases where it costs more to permute all of the
>         necessary values
>         into and out of the vector registers than is saved by
>         vectorizing. Does
>         the code generated with -bb-vectorize-req-chain-depth=2 run
>         faster than
>         the unvectorized code?
>         
>         The heuristic can certainly be improved, and these kinds of
>         test cases
>         are very important to that improvement process.
>         
>          -Hal
>         
>         On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
>         > I have a super-simple test case 4x4 matrix * 4-vector which
>         gets
>         > correctly unrolled, but is not vectorized by -bb-vectorize.
>         (I used
>         > llvm 3.1svn)
>         > I attached the test case so you can see what is going wrong
>         there.
>         >
>         > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
>         >         As some of you may know, I committed my basic-block
>         >         autovectorization
>         >         pass a few days ago. I encourage anyone interested
>         to try it
>         >         out (pass
>         >         -vectorize to opt or -mllvm -vectorize to clang) and
>         provide
>         >         feedback.
>         >         Especially in combination with
>         -unroll-allow-partial, I have
>         >         observed
>         >         some significant benchmark speedups, but, I have
>         also observed
>         >         some
>         >         significant slowdowns. I would like to share my
>         thoughts, and
>         >         hopefully
>         >         get feedback, on next steps.
>         >
>         >         1. "Target Data" for vectorization - I think that in
>         order to
>         >         improve
>         >         the vectorization quality, the vectorizer will need
>         more
>         >         information
>         >         about the target. This information could be provided
>         in the
>         >         form of a
>         >         kind of extended target data. This extended target
>         data might
>         >         contain:
>         >          - What basic types can be vectorized, and how many
>         of them
>         >         will fit
>         >         into (the largest) vector registers
>         >          - What classes of operations can be vectorized
>         (division,
>         >         conversions /
>         >         sign extension, etc. are not always supported)
>         >          - What alignment is necessary for loads and stores
>         >          - Is scalar-to-vector free?
>         >
>         >         2. Feedback between passes - We may to implement a
>         closer
>         >         coupling
>         >         between optimization passes than currently exists.
>         >         Specifically, I have
>         >         in mind two things:
>         >          - The vectorizer should communicate more closely
>         with the
>         >         loop
>         >         unroller. First, the loop unroller should try to
>         unroll to
>         >         preserve
>         >         maximal load/store alignments. Second, I think it
>         would make a
>         >         lot of
>         >         sense to be able to unroll and, only if this helps
>         >         vectorization should
>         >         the unrolled version be kept in preference to the
>         original.
>         >         With basic
>         >         block vectorization, it is often necessary to
>         (partially)
>         >         unroll in
>         >         order to vectorize. Even when we also have real loop
>         >         vectorization,
>         >         however, I still think that it will be important for
>         the loop
>         >         unroller
>         >         to communicate with the vectorizer.
>         >          - After vectorization, it would make sense for the
>         >         vectorization pass
>         >         to request further simplification, but only on those
>         parts of
>         >         the code
>         >         that it modified.
>         >
>         >         3. Loop vectorization - It would be nice to have, in
>         addition
>         >         to
>         >         basic-block vectorization, a more-traditional loop
>         >         vectorization pass. I
>         >         think that we'll need a better loop analysis pass in
>         order for
>         >         this to
>         >         happen. Some of this was started in
>         LoopDependenceAnalysis,
>         >         but that
>         >         pass is not yet finished. We'll need something like
>         this to
>         >         recognize
>         >         affine memory references, etc.
>         >
>         >         I look forward to hearing everyone's thoughts.
>         >
>         >          -Hal
>         >
>         >         --
>         >         Hal Finkel
>         >         Postdoctoral Appointee
>         >         Leadership Computing Facility
>         >         Argonne National Laboratory
>         >
>         >         _______________________________________________
>         >         LLVM Developers mailing list
>         >         LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>         >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>         >
>         
>         --
>         Hal Finkel
>         Postdoctoral Appointee
>         Leadership Computing Facility
>         Argonne National Laboratory
>         
>         
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix2.c
Type: text/x-csrc
Size: 424 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix3.c
Type: text/x-csrc
Size: 480 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment-0001.c>