[LLVMdev] Vectorization: Next Steps

Tue Feb 14 14:51:57 PST 2012

That works. Thank you.
Will -vectorize become default later?

2012/2/14 Hal Finkel <hfinkel at anl.gov>

> If you run with -vectorize instead of -bb-vectorize it will schedule the
> cleanup passes for you.
>
> -Hal
>
> *Sent from my Verizon Wireless Droid*
>
>
> -----Original message-----
>
> *From: *"Carl-Philip Hänsch" <cphaensch at googlemail.com>*
> To: *Hal Finkel <hfinkel at anl.gov>*
> Cc: *llvmdev at cs.uiuc.edu*
> Sent: *Tue, Feb 14, 2012 16:10:28 GMT+00:00
> *
> Subject: *Re: [LLVMdev] Vectorization: Next Steps
>
> I tested the "restricted" keyword and it works well :)
>
> The generated code is a bunch of shufflevector instructions, but after a
> second -O3 pass, everything looks fine.
> This problem is described in my ML post "passes propose passes" and occurs
> here again. LLVM has so much great passes, but they cannot start again when
> the code was somewhat simplified :(
> Maybe that's one more reason to tell the pass scheduler to redo some
> passes to find all optimizations. The core really simplifies to what I
> expected.
>
> 2012/2/13 Hal Finkel <hfinkel at anl.gov>
>
>> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
>> > I will test your suggestion, but I designed the test case to load the
>> > memory directly into <4 x float> registers. So there is absolutely no
>> > permutation and other swizzle or move operations. Maybe the heuristic
>> > should not only count the depth but also the surrounding load/store
>> > operations.
>>
>> I've attached two variants of your file, both which vectorize as you'd
>> expect. The core difference between these and your original file is that
>> I added the 'restrict' keyword so that the compiler can assume that the
>> arrays don't alias (or, in the first case, I made them globals). You
>> also probably need to specify some alignment information, otherwise the
>> memory operations will be scalarized in codegen.
>>
>>  -Hal
>>
>> >
>> > Are the load/store operations vectorized, too? (I designed the test
>> > case to completely fit the SSE registers)
>> >
>> > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>> >         Carl-Philip,
>> >
>> >         The reason that this does not vectorize is that it cannot
>> >         vectorize the
>> >         stores; this leaves only the mul-add chains (and some chains
>> >         with
>> >         loads), and they only have a depth of 2 (the threshold is 6).
>> >
>> >         If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
>> >         it will
>> >         vectorize. The reason the heuristic has such a large default
>> >         value is to
>> >         prevent cases where it costs more to permute all of the
>> >         necessary values
>> >         into and out of the vector registers than is saved by
>> >         vectorizing. Does
>> >         the code generated with -bb-vectorize-req-chain-depth=2 run
>> >         faster than
>> >         the unvectorized code?
>> >
>> >         The heuristic can certainly be improved, and these kinds of
>> >         test cases
>> >         are very important to that improvement process.
>> >
>> >          -Hal
>> >
>> >         On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
>> >         > I have a super-simple test case 4x4 matrix * 4-vector which
>> >         gets
>> >         > correctly unrolled, but is not vectorized by -bb-vectorize.
>> >         (I used
>> >         > llvm 3.1svn)
>> >         > I attached the test case so you can see what is going wrong
>> >         there.
>> >         >
>> >         > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
>> >         >         As some of you may know, I committed my basic-block
>> >         >         autovectorization
>> >         >         pass a few days ago. I encourage anyone interested
>> >         to try it
>> >         >         out (pass
>> >         >         -vectorize to opt or -mllvm -vectorize to clang) and
>> >         provide
>> >         >         feedback.
>> >         >         Especially in combination with
>> >         -unroll-allow-partial, I have
>> >         >         observed
>> >         >         some significant benchmark speedups, but, I have
>> >         also observed
>> >         >         some
>> >         >         significant slowdowns. I would like to share my
>> >         thoughts, and
>> >         >         hopefully
>> >         >         get feedback, on next steps.
>> >         >
>> >         >         1. "Target Data" for vectorization - I think that in
>> >         order to
>> >         >         improve
>> >         >         the vectorization quality, the vectorizer will need
>> >         more
>> >         >         information
>> >         >         about the target. This information could be provided
>> >         in the
>> >         >         form of a
>> >         >         kind of extended target data. This extended target
>> >         data might
>> >         >         contain:
>> >         >          - What basic types can be vectorized, and how many
>> >         of them
>> >         >         will fit
>> >         >         into (the largest) vector registers
>> >         >          - What classes of operations can be vectorized
>> >         (division,
>> >         >         conversions /
>> >         >         sign extension, etc. are not always supported)
>> >         >          - What alignment is necessary for loads and stores
>> >         >          - Is scalar-to-vector free?
>> >         >
>> >         >         2. Feedback between passes - We may to implement a
>> >         closer
>> >         >         coupling
>> >         >         between optimization passes than currently exists.
>> >         >         Specifically, I have
>> >         >         in mind two things:
>> >         >          - The vectorizer should communicate more closely
>> >         with the
>> >         >         loop
>> >         >         unroller. First, the loop unroller should try to
>> >         unroll to
>> >         >         preserve
>> >         >         maximal load/store alignments. Second, I think it
>> >         would make a
>> >         >         lot of
>> >         >         sense to be able to unroll and, only if this helps
>> >         >         vectorization should
>> >         >         the unrolled version be kept in preference to the
>> >         original.
>> >         >         With basic
>> >         >         block vectorization, it is often necessary to
>> >         (partially)
>> >         >         unroll in
>> >         >         order to vectorize. Even when we also have real loop
>> >         >         vectorization,
>> >         >         however, I still think that it will be important for
>> >         the loop
>> >         >         unroller
>> >         >         to communicate with the vectorizer.
>> >         >          - After vectorization, it would make sense for the
>> >         >         vectorization pass
>> >         >         to request further simplification, but only on those
>> >         parts of
>> >         >         the code
>> >         >         that it modified.
>> >         >
>> >         >         3. Loop vectorization - It would be nice to have, in
>> >         addition
>> >         >         to
>> >         >         basic-block vectorization, a more-traditional loop
>> >         >         vectorization pass. I
>> >         >         think that we'll need a better loop analysis pass in
>> >         order for
>> >         >         this to
>> >         >         happen. Some of this was started in
>> >         LoopDependenceAnalysis,
>> >         >         but that
>> >         >         pass is not yet finished. We'll need something like
>> >         this to
>> >         >         recognize
>> >         >         affine memory references, etc.
>> >         >
>> >         >         I look forward to hearing everyone's thoughts.
>> >         >
>> >         >          -Hal
>> >         >
>> >         >         --
>> >         >         Hal Finkel
>> >         >         Postdoctoral Appointee
>> >         >         Leadership Computing Facility
>> >         >         Argonne National Laboratory
>> >         >
>> >         >         _______________________________________________
>> >         >         LLVM Developers mailing list
>> >         >         LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> >         >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >         >
>> >
>> >         --
>> >         Hal Finkel
>> >         Postdoctoral Appointee
>> >         Leadership Computing Facility
>> >         Argonne National Laboratory
>> >
>> >
>> >
>>
>> --
>> Hal Finkel
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/29e445a0/attachment.html>