[LLVMdev] Vectorization: Next Steps
Hal Finkel
hfinkel at anl.gov
Tue Feb 14 20:23:11 PST 2012
On Tue, 2012-02-14 at 23:51 +0100, Carl-Philip Hänsch wrote:
> That works. Thank you.
> Will -vectorize become default later?
I don't know, but I think there is a lot of improvement to be made first.
-Hal
>
> 2012/2/14 Hal Finkel <hfinkel at anl.gov>
> If you run with -vectorize instead of -bb-vectorize it will
> schedule the cleanup passes for you.
>
> -Hal
>
> Sent from my Verizon Wireless Droid
>
>
> -----Original message-----
> From: "Carl-Philip Hänsch" <cphaensch at googlemail.com>
> To: Hal Finkel <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tue, Feb 14, 2012 16:10:28 GMT+00:00
>
> Subject: Re: [LLVMdev] Vectorization: Next Steps
>
>
> I tested the "restricted" keyword and it works well :)
>
> The generated code is a bunch of shufflevector
> instructions, but after a second -O3 pass, everything
> looks fine.
> This problem is described in my ML post "passes
> propose passes" and occurs here again. LLVM has so
> much great passes, but they cannot start again when
> the code was somewhat simplified :(
> Maybe that's one more reason to tell the pass
> scheduler to redo some passes to find all
> optimizations. The core really simplifies to what I
> expected.
>
> 2012/2/13 Hal Finkel <hfinkel at anl.gov>
> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip
> Hänsch wrote:
> > I will test your suggestion, but I designed
> the test case to load the
> > memory directly into <4 x float> registers.
> So there is absolutely no
> > permutation and other swizzle or move
> operations. Maybe the heuristic
> > should not only count the depth but also the
> surrounding load/store
> > operations.
>
>
> I've attached two variants of your file, both
> which vectorize as you'd
> expect. The core difference between these and
> your original file is that
> I added the 'restrict' keyword so that the
> compiler can assume that the
> arrays don't alias (or, in the first case, I
> made them globals). You
> also probably need to specify some alignment
> information, otherwise the
> memory operations will be scalarized in
> codegen.
>
> -Hal
>
> >
> > Are the load/store operations vectorized,
> too? (I designed the test
> > case to completely fit the SSE registers)
> >
> > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
> > Carl-Philip,
> >
> > The reason that this does not
> vectorize is that it cannot
> > vectorize the
> > stores; this leaves only the mul-add
> chains (and some chains
> > with
> > loads), and they only have a depth
> of 2 (the threshold is 6).
> >
> > If you give clang -mllvm
> -bb-vectorize-req-chain-depth=2 then
> > it will
> > vectorize. The reason the heuristic
> has such a large default
> > value is to
> > prevent cases where it costs more to
> permute all of the
> > necessary values
> > into and out of the vector registers
> than is saved by
> > vectorizing. Does
> > the code generated with
> -bb-vectorize-req-chain-depth=2 run
> > faster than
> > the unvectorized code?
> >
> > The heuristic can certainly be
> improved, and these kinds of
> > test cases
> > are very important to that
> improvement process.
> >
> > -Hal
> >
> > On Thu, 2012-02-09 at 13:27 +0100,
> Carl-Philip Hänsch wrote:
> > > I have a super-simple test case
> 4x4 matrix * 4-vector which
> > gets
> > > correctly unrolled, but is not
> vectorized by -bb-vectorize.
> > (I used
> > > llvm 3.1svn)
> > > I attached the test case so you
> can see what is going wrong
> > there.
> > >
> > > 2012/2/3 Hal Finkel
> <hfinkel at anl.gov>
> > > As some of you may know, I
> committed my basic-block
> > > autovectorization
> > > pass a few days ago. I
> encourage anyone interested
> > to try it
> > > out (pass
> > > -vectorize to opt or
> -mllvm -vectorize to clang) and
> > provide
> > > feedback.
> > > Especially in combination
> with
> > -unroll-allow-partial, I have
> > > observed
> > > some significant benchmark
> speedups, but, I have
> > also observed
> > > some
> > > significant slowdowns. I
> would like to share my
> > thoughts, and
> > > hopefully
> > > get feedback, on next
> steps.
> > >
> > > 1. "Target Data" for
> vectorization - I think that in
> > order to
> > > improve
> > > the vectorization quality,
> the vectorizer will need
> > more
> > > information
> > > about the target. This
> information could be provided
> > in the
> > > form of a
> > > kind of extended target
> data. This extended target
> > data might
> > > contain:
> > > - What basic types can be
> vectorized, and how many
> > of them
> > > will fit
> > > into (the largest) vector
> registers
> > > - What classes of
> operations can be vectorized
> > (division,
> > > conversions /
> > > sign extension, etc. are
> not always supported)
> > > - What alignment is
> necessary for loads and stores
> > > - Is scalar-to-vector
> free?
> > >
> > > 2. Feedback between passes
> - We may to implement a
> > closer
> > > coupling
> > > between optimization
> passes than currently exists.
> > > Specifically, I have
> > > in mind two things:
> > > - The vectorizer should
> communicate more closely
> > with the
> > > loop
> > > unroller. First, the loop
> unroller should try to
> > unroll to
> > > preserve
> > > maximal load/store
> alignments. Second, I think it
> > would make a
> > > lot of
> > > sense to be able to unroll
> and, only if this helps
> > > vectorization should
> > > the unrolled version be
> kept in preference to the
> > original.
> > > With basic
> > > block vectorization, it is
> often necessary to
> > (partially)
> > > unroll in
> > > order to vectorize. Even
> when we also have real loop
> > > vectorization,
> > > however, I still think
> that it will be important for
> > the loop
> > > unroller
> > > to communicate with the
> vectorizer.
> > > - After vectorization, it
> would make sense for the
> > > vectorization pass
> > > to request further
> simplification, but only on those
> > parts of
> > > the code
> > > that it modified.
> > >
> > > 3. Loop vectorization - It
> would be nice to have, in
> > addition
> > > to
> > > basic-block vectorization,
> a more-traditional loop
> > > vectorization pass. I
> > > think that we'll need a
> better loop analysis pass in
> > order for
> > > this to
> > > happen. Some of this was
> started in
> > LoopDependenceAnalysis,
> > > but that
> > > pass is not yet finished.
> We'll need something like
> > this to
> > > recognize
> > > affine memory references,
> etc.
> > >
> > > I look forward to hearing
> everyone's thoughts.
> > >
> > > -Hal
> > >
> > > --
> > > Hal Finkel
> > > Postdoctoral Appointee
> > > Leadership Computing
> Facility
> > > Argonne National
> Laboratory
> > >
> > >
> _______________________________________________
> > > LLVM Developers mailing
> list
> > > LLVMdev at cs.uiuc.edu
> http://llvm.cs.uiuc.edu
> > >
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> >
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
> >
>
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov
--
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-dev
mailing list