[LLVMdev] Vectorization: Next Steps
Carl-Philip Hänsch
cphaensch at googlemail.com
Mon Feb 13 02:11:38 PST 2012
I will test your suggestion, but I designed the test case to load the
memory directly into <4 x float> registers. So there is absolutely no
permutation and other swizzle or move operations. Maybe the heuristic
should not only count the depth but also the surrounding load/store
operations.
Are the load/store operations vectorized, too? (I designed the test case to
completely fit the SSE registers)
2012/2/10 Hal Finkel <hfinkel at anl.gov>
> Carl-Philip,
>
> The reason that this does not vectorize is that it cannot vectorize the
> stores; this leaves only the mul-add chains (and some chains with
> loads), and they only have a depth of 2 (the threshold is 6).
>
> If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then it will
> vectorize. The reason the heuristic has such a large default value is to
> prevent cases where it costs more to permute all of the necessary values
> into and out of the vector registers than is saved by vectorizing. Does
> the code generated with -bb-vectorize-req-chain-depth=2 run faster than
> the unvectorized code?
>
> The heuristic can certainly be improved, and these kinds of test cases
> are very important to that improvement process.
>
> -Hal
>
> On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
> > I have a super-simple test case 4x4 matrix * 4-vector which gets
> > correctly unrolled, but is not vectorized by -bb-vectorize. (I used
> > llvm 3.1svn)
> > I attached the test case so you can see what is going wrong there.
> >
> > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
> > As some of you may know, I committed my basic-block
> > autovectorization
> > pass a few days ago. I encourage anyone interested to try it
> > out (pass
> > -vectorize to opt or -mllvm -vectorize to clang) and provide
> > feedback.
> > Especially in combination with -unroll-allow-partial, I have
> > observed
> > some significant benchmark speedups, but, I have also observed
> > some
> > significant slowdowns. I would like to share my thoughts, and
> > hopefully
> > get feedback, on next steps.
> >
> > 1. "Target Data" for vectorization - I think that in order to
> > improve
> > the vectorization quality, the vectorizer will need more
> > information
> > about the target. This information could be provided in the
> > form of a
> > kind of extended target data. This extended target data might
> > contain:
> > - What basic types can be vectorized, and how many of them
> > will fit
> > into (the largest) vector registers
> > - What classes of operations can be vectorized (division,
> > conversions /
> > sign extension, etc. are not always supported)
> > - What alignment is necessary for loads and stores
> > - Is scalar-to-vector free?
> >
> > 2. Feedback between passes - We may to implement a closer
> > coupling
> > between optimization passes than currently exists.
> > Specifically, I have
> > in mind two things:
> > - The vectorizer should communicate more closely with the
> > loop
> > unroller. First, the loop unroller should try to unroll to
> > preserve
> > maximal load/store alignments. Second, I think it would make a
> > lot of
> > sense to be able to unroll and, only if this helps
> > vectorization should
> > the unrolled version be kept in preference to the original.
> > With basic
> > block vectorization, it is often necessary to (partially)
> > unroll in
> > order to vectorize. Even when we also have real loop
> > vectorization,
> > however, I still think that it will be important for the loop
> > unroller
> > to communicate with the vectorizer.
> > - After vectorization, it would make sense for the
> > vectorization pass
> > to request further simplification, but only on those parts of
> > the code
> > that it modified.
> >
> > 3. Loop vectorization - It would be nice to have, in addition
> > to
> > basic-block vectorization, a more-traditional loop
> > vectorization pass. I
> > think that we'll need a better loop analysis pass in order for
> > this to
> > happen. Some of this was started in LoopDependenceAnalysis,
> > but that
> > pass is not yet finished. We'll need something like this to
> > recognize
> > affine memory references, etc.
> >
> > I look forward to hearing everyone's thoughts.
> >
> > -Hal
> >
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/a6fc7472/attachment.html>
More information about the llvm-dev
mailing list