[LLVMdev] Vectorization: Next Steps
Carl-Philip Hänsch
cphaensch at googlemail.com
Tue Feb 14 14:51:57 PST 2012
That works. Thank you.
Will -vectorize become default later?
2012/2/14 Hal Finkel <hfinkel at anl.gov>
> If you run with -vectorize instead of -bb-vectorize it will schedule the
> cleanup passes for you.
>
> -Hal
>
> *Sent from my Verizon Wireless Droid*
>
>
> -----Original message-----
>
> *From: *"Carl-Philip Hänsch" <cphaensch at googlemail.com>*
> To: *Hal Finkel <hfinkel at anl.gov>*
> Cc: *llvmdev at cs.uiuc.edu*
> Sent: *Tue, Feb 14, 2012 16:10:28 GMT+00:00
> *
> Subject: *Re: [LLVMdev] Vectorization: Next Steps
>
> I tested the "restricted" keyword and it works well :)
>
> The generated code is a bunch of shufflevector instructions, but after a
> second -O3 pass, everything looks fine.
> This problem is described in my ML post "passes propose passes" and occurs
> here again. LLVM has so much great passes, but they cannot start again when
> the code was somewhat simplified :(
> Maybe that's one more reason to tell the pass scheduler to redo some
> passes to find all optimizations. The core really simplifies to what I
> expected.
>
> 2012/2/13 Hal Finkel <hfinkel at anl.gov>
>
>> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
>> > I will test your suggestion, but I designed the test case to load the
>> > memory directly into <4 x float> registers. So there is absolutely no
>> > permutation and other swizzle or move operations. Maybe the heuristic
>> > should not only count the depth but also the surrounding load/store
>> > operations.
>>
>> I've attached two variants of your file, both which vectorize as you'd
>> expect. The core difference between these and your original file is that
>> I added the 'restrict' keyword so that the compiler can assume that the
>> arrays don't alias (or, in the first case, I made them globals). You
>> also probably need to specify some alignment information, otherwise the
>> memory operations will be scalarized in codegen.
>>
>> -Hal
>>
>> >
>> > Are the load/store operations vectorized, too? (I designed the test
>> > case to completely fit the SSE registers)
>> >
>> > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>> > Carl-Philip,
>> >
>> > The reason that this does not vectorize is that it cannot
>> > vectorize the
>> > stores; this leaves only the mul-add chains (and some chains
>> > with
>> > loads), and they only have a depth of 2 (the threshold is 6).
>> >
>> > If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
>> > it will
>> > vectorize. The reason the heuristic has such a large default
>> > value is to
>> > prevent cases where it costs more to permute all of the
>> > necessary values
>> > into and out of the vector registers than is saved by
>> > vectorizing. Does
>> > the code generated with -bb-vectorize-req-chain-depth=2 run
>> > faster than
>> > the unvectorized code?
>> >
>> > The heuristic can certainly be improved, and these kinds of
>> > test cases
>> > are very important to that improvement process.
>> >
>> > -Hal
>> >
>> > On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
>> > > I have a super-simple test case 4x4 matrix * 4-vector which
>> > gets
>> > > correctly unrolled, but is not vectorized by -bb-vectorize.
>> > (I used
>> > > llvm 3.1svn)
>> > > I attached the test case so you can see what is going wrong
>> > there.
>> > >
>> > > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
>> > > As some of you may know, I committed my basic-block
>> > > autovectorization
>> > > pass a few days ago. I encourage anyone interested
>> > to try it
>> > > out (pass
>> > > -vectorize to opt or -mllvm -vectorize to clang) and
>> > provide
>> > > feedback.
>> > > Especially in combination with
>> > -unroll-allow-partial, I have
>> > > observed
>> > > some significant benchmark speedups, but, I have
>> > also observed
>> > > some
>> > > significant slowdowns. I would like to share my
>> > thoughts, and
>> > > hopefully
>> > > get feedback, on next steps.
>> > >
>> > > 1. "Target Data" for vectorization - I think that in
>> > order to
>> > > improve
>> > > the vectorization quality, the vectorizer will need
>> > more
>> > > information
>> > > about the target. This information could be provided
>> > in the
>> > > form of a
>> > > kind of extended target data. This extended target
>> > data might
>> > > contain:
>> > > - What basic types can be vectorized, and how many
>> > of them
>> > > will fit
>> > > into (the largest) vector registers
>> > > - What classes of operations can be vectorized
>> > (division,
>> > > conversions /
>> > > sign extension, etc. are not always supported)
>> > > - What alignment is necessary for loads and stores
>> > > - Is scalar-to-vector free?
>> > >
>> > > 2. Feedback between passes - We may to implement a
>> > closer
>> > > coupling
>> > > between optimization passes than currently exists.
>> > > Specifically, I have
>> > > in mind two things:
>> > > - The vectorizer should communicate more closely
>> > with the
>> > > loop
>> > > unroller. First, the loop unroller should try to
>> > unroll to
>> > > preserve
>> > > maximal load/store alignments. Second, I think it
>> > would make a
>> > > lot of
>> > > sense to be able to unroll and, only if this helps
>> > > vectorization should
>> > > the unrolled version be kept in preference to the
>> > original.
>> > > With basic
>> > > block vectorization, it is often necessary to
>> > (partially)
>> > > unroll in
>> > > order to vectorize. Even when we also have real loop
>> > > vectorization,
>> > > however, I still think that it will be important for
>> > the loop
>> > > unroller
>> > > to communicate with the vectorizer.
>> > > - After vectorization, it would make sense for the
>> > > vectorization pass
>> > > to request further simplification, but only on those
>> > parts of
>> > > the code
>> > > that it modified.
>> > >
>> > > 3. Loop vectorization - It would be nice to have, in
>> > addition
>> > > to
>> > > basic-block vectorization, a more-traditional loop
>> > > vectorization pass. I
>> > > think that we'll need a better loop analysis pass in
>> > order for
>> > > this to
>> > > happen. Some of this was started in
>> > LoopDependenceAnalysis,
>> > > but that
>> > > pass is not yet finished. We'll need something like
>> > this to
>> > > recognize
>> > > affine memory references, etc.
>> > >
>> > > I look forward to hearing everyone's thoughts.
>> > >
>> > > -Hal
>> > >
>> > > --
>> > > Hal Finkel
>> > > Postdoctoral Appointee
>> > > Leadership Computing Facility
>> > > Argonne National Laboratory
>> > >
>> > > _______________________________________________
>> > > LLVM Developers mailing list
>> > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> > >
>> >
>> > --
>> > Hal Finkel
>> > Postdoctoral Appointee
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>> >
>> >
>> >
>>
>> --
>> Hal Finkel
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/29e445a0/attachment.html>
More information about the llvm-dev
mailing list