[LLVMdev] Vectorization: Next Steps

Tue Feb 14 20:23:11 PST 2012

On Tue, 2012-02-14 at 23:51 +0100, Carl-Philip Hänsch wrote:
> That works. Thank you.
> Will -vectorize become default later?

I don't know, but I think there is a lot of improvement to be made first.

 -Hal

> 
> 2012/2/14 Hal Finkel <hfinkel at anl.gov>
>         If you run with -vectorize instead of -bb-vectorize it will
>         schedule the cleanup passes for you.
>         
>         -Hal
>         
>         Sent from my Verizon Wireless Droid
>         
>         
>         -----Original message-----
>                 From: "Carl-Philip Hänsch" <cphaensch at googlemail.com>
>                 To: Hal Finkel <hfinkel at anl.gov>
>                 Cc: llvmdev at cs.uiuc.edu
>                 Sent: Tue, Feb 14, 2012 16:10:28 GMT+00:00
>                 
>                 Subject: Re: [LLVMdev] Vectorization: Next Steps
>                 
>                 
>                 I tested the "restricted" keyword and it works well :)
>                 
>                 The generated code is a bunch of shufflevector
>                 instructions, but after a second -O3 pass, everything
>                 looks fine.
>                 This problem is described in my ML post "passes
>                 propose passes" and occurs here again. LLVM has so
>                 much great passes, but they cannot start again when
>                 the code was somewhat simplified :(
>                 Maybe that's one more reason to tell the pass
>                 scheduler to redo some passes to find all
>                 optimizations. The core really simplifies to what I
>                 expected.
>                 
>                 2012/2/13 Hal Finkel <hfinkel at anl.gov>
>                         On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip
>                         Hänsch wrote:
>                         > I will test your suggestion, but I designed
>                         the test case to load the
>                         > memory directly into <4 x float> registers.
>                         So there is absolutely no
>                         > permutation and other swizzle or move
>                         operations. Maybe the heuristic
>                         > should not only count the depth but also the
>                         surrounding load/store
>                         > operations.
>                         
>                         
>                         I've attached two variants of your file, both
>                         which vectorize as you'd
>                         expect. The core difference between these and
>                         your original file is that
>                         I added the 'restrict' keyword so that the
>                         compiler can assume that the
>                         arrays don't alias (or, in the first case, I
>                         made them globals). You
>                         also probably need to specify some alignment
>                         information, otherwise the
>                         memory operations will be scalarized in
>                         codegen.
>                         
>                          -Hal
>                         
>                         >
>                         > Are the load/store operations vectorized,
>                         too? (I designed the test
>                         > case to completely fit the SSE registers)
>                         >
>                         > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>                         >         Carl-Philip,
>                         >
>                         >         The reason that this does not
>                         vectorize is that it cannot
>                         >         vectorize the
>                         >         stores; this leaves only the mul-add
>                         chains (and some chains
>                         >         with
>                         >         loads), and they only have a depth
>                         of 2 (the threshold is 6).
>                         >
>                         >         If you give clang -mllvm
>                         -bb-vectorize-req-chain-depth=2 then
>                         >         it will
>                         >         vectorize. The reason the heuristic
>                         has such a large default
>                         >         value is to
>                         >         prevent cases where it costs more to
>                         permute all of the
>                         >         necessary values
>                         >         into and out of the vector registers
>                         than is saved by
>                         >         vectorizing. Does
>                         >         the code generated with
>                         -bb-vectorize-req-chain-depth=2 run
>                         >         faster than
>                         >         the unvectorized code?
>                         >
>                         >         The heuristic can certainly be
>                         improved, and these kinds of
>                         >         test cases
>                         >         are very important to that
>                         improvement process.
>                         >
>                         >          -Hal
>                         >
>                         >         On Thu, 2012-02-09 at 13:27 +0100,
>                         Carl-Philip Hänsch wrote:
>                         >         > I have a super-simple test case
>                         4x4 matrix * 4-vector which
>                         >         gets
>                         >         > correctly unrolled, but is not
>                         vectorized by -bb-vectorize.
>                         >         (I used
>                         >         > llvm 3.1svn)
>                         >         > I attached the test case so you
>                         can see what is going wrong
>                         >         there.
>                         >         >
>                         >         > 2012/2/3 Hal Finkel
>                         <hfinkel at anl.gov>
>                         >         >         As some of you may know, I
>                         committed my basic-block
>                         >         >         autovectorization
>                         >         >         pass a few days ago. I
>                         encourage anyone interested
>                         >         to try it
>                         >         >         out (pass
>                         >         >         -vectorize to opt or
>                         -mllvm -vectorize to clang) and
>                         >         provide
>                         >         >         feedback.
>                         >         >         Especially in combination
>                         with
>                         >         -unroll-allow-partial, I have
>                         >         >         observed
>                         >         >         some significant benchmark
>                         speedups, but, I have
>                         >         also observed
>                         >         >         some
>                         >         >         significant slowdowns. I
>                         would like to share my
>                         >         thoughts, and
>                         >         >         hopefully
>                         >         >         get feedback, on next
>                         steps.
>                         >         >
>                         >         >         1. "Target Data" for
>                         vectorization - I think that in
>                         >         order to
>                         >         >         improve
>                         >         >         the vectorization quality,
>                         the vectorizer will need
>                         >         more
>                         >         >         information
>                         >         >         about the target. This
>                         information could be provided
>                         >         in the
>                         >         >         form of a
>                         >         >         kind of extended target
>                         data. This extended target
>                         >         data might
>                         >         >         contain:
>                         >         >          - What basic types can be
>                         vectorized, and how many
>                         >         of them
>                         >         >         will fit
>                         >         >         into (the largest) vector
>                         registers
>                         >         >          - What classes of
>                         operations can be vectorized
>                         >         (division,
>                         >         >         conversions /
>                         >         >         sign extension, etc. are
>                         not always supported)
>                         >         >          - What alignment is
>                         necessary for loads and stores
>                         >         >          - Is scalar-to-vector
>                         free?
>                         >         >
>                         >         >         2. Feedback between passes
>                         - We may to implement a
>                         >         closer
>                         >         >         coupling
>                         >         >         between optimization
>                         passes than currently exists.
>                         >         >         Specifically, I have
>                         >         >         in mind two things:
>                         >         >          - The vectorizer should
>                         communicate more closely
>                         >         with the
>                         >         >         loop
>                         >         >         unroller. First, the loop
>                         unroller should try to
>                         >         unroll to
>                         >         >         preserve
>                         >         >         maximal load/store
>                         alignments. Second, I think it
>                         >         would make a
>                         >         >         lot of
>                         >         >         sense to be able to unroll
>                         and, only if this helps
>                         >         >         vectorization should
>                         >         >         the unrolled version be
>                         kept in preference to the
>                         >         original.
>                         >         >         With basic
>                         >         >         block vectorization, it is
>                         often necessary to
>                         >         (partially)
>                         >         >         unroll in
>                         >         >         order to vectorize. Even
>                         when we also have real loop
>                         >         >         vectorization,
>                         >         >         however, I still think
>                         that it will be important for
>                         >         the loop
>                         >         >         unroller
>                         >         >         to communicate with the
>                         vectorizer.
>                         >         >          - After vectorization, it
>                         would make sense for the
>                         >         >         vectorization pass
>                         >         >         to request further
>                         simplification, but only on those
>                         >         parts of
>                         >         >         the code
>                         >         >         that it modified.
>                         >         >
>                         >         >         3. Loop vectorization - It
>                         would be nice to have, in
>                         >         addition
>                         >         >         to
>                         >         >         basic-block vectorization,
>                         a more-traditional loop
>                         >         >         vectorization pass. I
>                         >         >         think that we'll need a
>                         better loop analysis pass in
>                         >         order for
>                         >         >         this to
>                         >         >         happen. Some of this was
>                         started in
>                         >         LoopDependenceAnalysis,
>                         >         >         but that
>                         >         >         pass is not yet finished.
>                         We'll need something like
>                         >         this to
>                         >         >         recognize
>                         >         >         affine memory references,
>                         etc.
>                         >         >
>                         >         >         I look forward to hearing
>                         everyone's thoughts.
>                         >         >
>                         >         >          -Hal
>                         >         >
>                         >         >         --
>                         >         >         Hal Finkel
>                         >         >         Postdoctoral Appointee
>                         >         >         Leadership Computing
>                         Facility
>                         >         >         Argonne National
>                         Laboratory
>                         >         >
>                         >         >
>                         _______________________________________________
>                         >         >         LLVM Developers mailing
>                         list
>                         >         >         LLVMdev at cs.uiuc.edu
>                         http://llvm.cs.uiuc.edu
>                         >         >
>                         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>                         >         >
>                         >
>                         >         --
>                         >         Hal Finkel
>                         >         Postdoctoral Appointee
>                         >         Leadership Computing Facility
>                         >         Argonne National Laboratory
>                         >
>                         >
>                         >
>                         
>                         --
>                         Hal Finkel
>                         Postdoctoral Appointee
>                         Leadership Computing Facility
>                         Argonne National Laboratory
>                         
>                 
>                 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory