[LLVMdev] Vectorization: Next Steps

Mon Feb 13 02:11:38 PST 2012

I will test your suggestion, but I designed the test case to load the
memory directly into <4 x float> registers. So there is absolutely no
permutation and other swizzle or move operations. Maybe the heuristic
should not only count the depth but also the surrounding load/store
operations.
Are the load/store operations vectorized, too? (I designed the test case to
completely fit the SSE registers)

2012/2/10 Hal Finkel <hfinkel at anl.gov>

> Carl-Philip,
>
> The reason that this does not vectorize is that it cannot vectorize the
> stores; this leaves only the mul-add chains (and some chains with
> loads), and they only have a depth of 2 (the threshold is 6).
>
> If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then it will
> vectorize. The reason the heuristic has such a large default value is to
> prevent cases where it costs more to permute all of the necessary values
> into and out of the vector registers than is saved by vectorizing. Does
> the code generated with -bb-vectorize-req-chain-depth=2 run faster than
> the unvectorized code?
>
> The heuristic can certainly be improved, and these kinds of test cases
> are very important to that improvement process.
>
>  -Hal
>
> On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
> > I have a super-simple test case 4x4 matrix * 4-vector which gets
> > correctly unrolled, but is not vectorized by -bb-vectorize. (I used
> > llvm 3.1svn)
> > I attached the test case so you can see what is going wrong there.
> >
> > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
> >         As some of you may know, I committed my basic-block
> >         autovectorization
> >         pass a few days ago. I encourage anyone interested to try it
> >         out (pass
> >         -vectorize to opt or -mllvm -vectorize to clang) and provide
> >         feedback.
> >         Especially in combination with -unroll-allow-partial, I have
> >         observed
> >         some significant benchmark speedups, but, I have also observed
> >         some
> >         significant slowdowns. I would like to share my thoughts, and
> >         hopefully
> >         get feedback, on next steps.
> >
> >         1. "Target Data" for vectorization - I think that in order to
> >         improve
> >         the vectorization quality, the vectorizer will need more
> >         information
> >         about the target. This information could be provided in the
> >         form of a
> >         kind of extended target data. This extended target data might
> >         contain:
> >          - What basic types can be vectorized, and how many of them
> >         will fit
> >         into (the largest) vector registers
> >          - What classes of operations can be vectorized (division,
> >         conversions /
> >         sign extension, etc. are not always supported)
> >          - What alignment is necessary for loads and stores
> >          - Is scalar-to-vector free?
> >
> >         2. Feedback between passes - We may to implement a closer
> >         coupling
> >         between optimization passes than currently exists.
> >         Specifically, I have
> >         in mind two things:
> >          - The vectorizer should communicate more closely with the
> >         loop
> >         unroller. First, the loop unroller should try to unroll to
> >         preserve
> >         maximal load/store alignments. Second, I think it would make a
> >         lot of
> >         sense to be able to unroll and, only if this helps
> >         vectorization should
> >         the unrolled version be kept in preference to the original.
> >         With basic
> >         block vectorization, it is often necessary to (partially)
> >         unroll in
> >         order to vectorize. Even when we also have real loop
> >         vectorization,
> >         however, I still think that it will be important for the loop
> >         unroller
> >         to communicate with the vectorizer.
> >          - After vectorization, it would make sense for the
> >         vectorization pass
> >         to request further simplification, but only on those parts of
> >         the code
> >         that it modified.
> >
> >         3. Loop vectorization - It would be nice to have, in addition
> >         to
> >         basic-block vectorization, a more-traditional loop
> >         vectorization pass. I
> >         think that we'll need a better loop analysis pass in order for
> >         this to
> >         happen. Some of this was started in LoopDependenceAnalysis,
> >         but that
> >         pass is not yet finished. We'll need something like this to
> >         recognize
> >         affine memory references, etc.
> >
> >         I look forward to hearing everyone's thoughts.
> >
> >          -Hal
> >
> >         --
> >         Hal Finkel
> >         Postdoctoral Appointee
> >         Leadership Computing Facility
> >         Argonne National Laboratory
> >
> >         _______________________________________________
> >         LLVM Developers mailing list
> >         LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/a6fc7472/attachment.html>