[PATCH] Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86

Thu Jun 11 14:30:37 PDT 2015

On Thu, Jun 11, 2015 at 2:27 PM Xinliang David Li <davidxl at google.com>
wrote:

> On Thu, Jun 11, 2015 at 2:19 PM, Arnold Schwaighofer
> <aschwaighofer at apple.com> wrote:
> >
> >> On Jun 11, 2015, at 2:02 PM, Xinliang David Li <davidxl at google.com>
> wrote:
> >>
> >>>
> >>> The “best solution” would be to send the vectorized code to the
> backend and ask whether it is better or worse - We don’t want to do this
> because it is compile time expensive.
> >>
> >> Why is it compile time expensive?
> >
> > Rerunning parts of the pipeline (at least CodeGenPrepare) and ISel to
> MachineIR for every vectorized and the scalar variant is expensive.
> Realistically you would want to run more of the LLVM IR pipeline to cleanup
> the vector code to get an somewhat accurate estimate.
>
> May not be so expensive relatively though. Compile time these days is
> mostly spent in parsing.

We can't necessarily take this as a valid assumption. There are a lot of
uses of llvm outside of traditional compilation that'll want vectorization.
(e.g. run time compilation of gpu shaders etc).

-eric

> David
>
>
> >
> >
> >>
> >> David
> >>
> >>
> >>> So we have solutions that are going to be imprecise and the question
> is how complex to we want to make our modeling.
> >>>
> >>>
> >>>>>> 2) "We don’t want TTI to expose an API that will be the superset of
> all target specific information …”: What belongs to the TTI and what
> doesn’t? Can’t  a target decide for itself?
> >>>
> >>> I think what that means is that we don’t want to expose every targets
> idiosyncrasy at the LLVM IR level. There might of course be room for
> certain things. Again any proposal has to be evaluated individually.
> >>>
> >>>>>> 3) " implement a peephole to reverse vectorization if needed, in
> the x86 backend”: that amounts to “Work. Check if that work should have
> been done. If not undo that work and do it better.” and is certainly more
> expensive than querying a cost function. And this is both harder and a
> different concept than possibly splitting memory ops late in the pipeline.
> >>>
> >>> Frankly, I believe the cost is neglectable and not really a concern
> here.
> >>>
> >>> This would probably be a simple ISel pattern family, what is complex
> about this?
> >>>
> >>> (store (shr/… (load <2 x i32> )) addr) -> scalar version ...
> >>>
> >>> If we can be more precise I am all for it if we can do it with
> reasonable complexity. Replicating ISel should be a non-goal in my opinion.
> >>>
> >>> In this concrete instance we might decide to make a shr with a (small)
> constant more expensive for vectors of two i64 to not pessimize scalar code
> (and maybe other combinations) . Hard to say I have not done the thorough
> analysis. Making such a change requires analysis of a set of micro kernels,
> the architecture and so on … all the suggestions that I have made are based
> on incomplete information.
> >>>
> >>>
> >>> What are you proposing? Only a concrete solution can be evaluated.
> >>>
> >>>
> >>>> On Jun 11, 2015, at 11:35 AM, Gerolf Hoflehner <ghoflehner at apple.com>
> wrote:
> >>>>
> >>>> When the analysis is off nothing is to be done.  That should be easy
> to verify.
> >>>>
> >>>> Yes, there is plenty of problems with the patch itself. But all that
> comes later and is orthogonal to the big picture questions I'm after.
> >>>>
> >>>> Gerolf
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>> On Jun 10, 2015, at 8:05 PM, Arnold <aschwaighofer at apple.com> wrote:
> >>>>>
> >>>>> Are we even sure we have the right explanation of why the scalar
> version is faster and that we would be modeling the right thing here? Sure
> the Intel Isa has two memory ops for scalar insts  but when that gets
> translated by the processors front end I would expect this to be translated
> to the similar micro ops as the vectorized version (other than the obvious
> difference). One can consult Agner's instructions manuals to get these
> values ....
> >>>>>
> >>>>> A quick glance at Haswell:
> >>>>>
> >>>>> 4 x Shr m,i unfused uops 4 = 16 uops
> >>>>>
> >>>>> Vs
> >>>>>
> >>>>> 1 x movdqa m,x uops 2 +
> >>>>> 1 x psrlq uops 2 +
> >>>>> 1 x movdqa x,m uops 1 = 5 uops
> >>>>>
> >>>>> Now we would have to look at execution Ports what else is executing,
> etc
> >>>>>
> >>>>> Is the scalar version always faster? or only in this benchmark's
> context? Does the behavior depend on other instructions executing
> concurrently?
> >>>>>
> >>>>> Why did we choose the cost of two time the vector length? Would this
> be the right constant on all targets? How does that related to the actual
> execution on an out of order target processor? Is the throughput half?
> >>>>>
> >>>>> We are never going to estimate the cost precisely without lowering
> to machine instructions and simulating the processor (for example, we do
> this in the machine combiner or the machine if converter where we simulate
> somewhat realistically throughput and critical path).
> >>>>>
> >>>>> So more 'precision' along one dimension on llvm IR might only mean a
> different lie but at the cost of added complexity. This is a trade off.
> >>>>>
> >>>>> I am not convinced that enough evidence has been established that
> the memory operand is the issue here and the model suggested is the right
> one.
> >>>>>
> >>>>> I also don't think that we want to implement part of target
> lowering's logic in the vectorizer. Sure this one example does not add much
> but there would be more ...
> >>>>>
> >>>>> Another way of looking at this is if the scalar version is always
> better should we scalarize irrespective of whether that code came out of
> the vectorizer or was written otherwise (clang vector intrinsics)?
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On Jun 10, 2015, at 6:13 PM, Gerolf Hoflehner <ghoflehner at apple.com>
> wrote:
> >>>>>>
> >>>>>> Hi Nadav,
> >>>>>>
> >>>>>> I’d like to dig more into this a bit more. It relates to the
> question what the cost model for vectorizers should look like.
> >>>>>> 1) "the vectorizers should not model things like instruction
> encodings because it can complicate the vectorizer and TTI significantly”:
> You point to some HW specific assumptions and this is one more. From the
> code I don’t see that it increases complexity - at least not significantly.
> And a target may decide that more expensive cost analysis is worth the
> effort. The general follow up questions are: Where is the line what the
> cost model should take into and what it shouldn’t? And when is it “too
> complex”?
> >>>>>> 2) "We don’t want TTI to expose an API that will be the superset of
> all target specific information …”: What belongs to the TTI and what
> doesn’t? Can’t  a target decide for itself?
> >>>>>> 3) " implement a peephole to reverse vectorization if needed, in
> the x86 backend”: that amounts to “Work. Check if that work should have
> been done. If not undo that work and do it better.” and is certainly more
> expensive than querying a cost function. And this is both harder and a
> different concept than possibly splitting memory ops late in the pipeline.
> >>>>>>
> >>>>>> Thanks
> >>>>>> Gerolf
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Jun 10, 2015, at 9:42 AM, Nadav Rotem <nrotem at apple.com> wrote:
> >>>>>>>
> >>>>>>> Hi Wei,
> >>>>>>>
> >>>>>>> Thank you for working on this. The SLP and Loop vectorizers (and
> the target transform info) can’t model the entire complexity of the
> different targets. The vectorizers do not model the cost of addressing
> modes and assume that folding memory operations into instructions is only a
> code size reduction that does not influence runtime performance because
> out-of-order processors translate these instructions into multiple uops
> that perform the load/store and the arithmetic operation. This is only an
> approximation of the way out-of-order processors work, but we need to make
> such assumptions if we want to limit the complexity of the vectorizer.  One
> exception to this rule is the special handling of vector geps, and
> scatter/gather instructions. Your example demonstrates that the assumption
> that we make about how out-of-order processors work is not always true.
> However, I still believe that the vectorizers should not model things like
> instruction encodings because it can complicate the vectorizer and TTI
> significantly. I believe that a better place to make a decision about the
> optimal code sequence in your example would be SelectionDAG. The codegen
> has more information to make such a decision. We don’t want TTI to expose
> an API that will be the superset of all target specific information that
> any pass may care about. I suggest that we keep the current vectorizer cost
> model and implement a peephole to reverse vectorization if needed, in the
> x86 backend. We already do something similar for wide AVX loads. On
> Sandybridge it is often beneficial to split 256bit loads/stores, and the
> decision to split such loads is done in the codegen and not inside the
> vectorizer.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Nadav
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Jun 9, 2015, at 5:33 PM, Wei Mi <wmi at google.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi nadav, aschwaighofer,
> >>>>>>>>
> >>>>>>>> This is the patch to fix the performance problem reported in
> https://llvm.org/bugs/show_bug.cgi?id=23510.
> >>>>>>>>
> >>>>>>>> Many X86 scalar instructions support using memory operand as
> destination but most vector instructions do not support it. In SLP cost
> evaluation,
> >>>>>>>>
> >>>>>>>> scalar version:
> >>>>>>>> t1 = load [mem];
> >>>>>>>> t1 = shift 5, t1
> >>>>>>>> store t1, [mem]
> >>>>>>>> ...
> >>>>>>>> t4 = load [mem4];
> >>>>>>>> t4 = shift 5, t4
> >>>>>>>> store t4, [mem4]
> >>>>>>>>
> >>>>>>>> slp vectorized version:
> >>>>>>>> v1 = vload [mem];
> >>>>>>>> v1 = vshift 5, v1
> >>>>>>>> store v1, [mem]
> >>>>>>>>
> >>>>>>>> SLP cost model thinks there will be 12 - 3 = 9 insns savings. But
> scalar version can be converted to the following form on x86 while
> vectorized instruction cannot:
> >>>>>>>>
> >>>>>>>> [mem1] = shift 5, [mem1]
> >>>>>>>> [mem2] = shift 5, [mem2]
> >>>>>>>> [mem3] = shift 5, [mem3]
> >>>>>>>> [mem4] = shift 5, [mem4]
> >>>>>>>>
> >>>>>>>> We add the extra cost VL * 2 to the SLP cost evaluation to handle
> such case (VL is the vector length).
> >>>>>>>>
> >>>>>>>> REPOSITORY
> >>>>>>>> rL LLVM
> >>>>>>>>
> >>>>>>>> http://reviews.llvm.org/D10352
> >>>>>>>>
> >>>>>>>> Files:
> >>>>>>>> include/llvm/Analysis/TargetTransformInfo.h
> >>>>>>>> include/llvm/Analysis/TargetTransformInfoImpl.h
> >>>>>>>> lib/Analysis/TargetTransformInfo.cpp
> >>>>>>>> lib/Target/X86/X86TargetTransformInfo.cpp
> >>>>>>>> lib/Target/X86/X86TargetTransformInfo.h
> >>>>>>>> lib/Transforms/Vectorize/SLPVectorizer.cpp
> >>>>>>>> test/Transforms/SLPVectorizer/X86/pr23510.ll
> >>>>>>>>
> >>>>>>>> EMAIL PREFERENCES
> >>>>>>>> http://reviews.llvm.org/settings/panel/emailpreferences/
> >>>>>>>> <D10352.27417.patch>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> llvm-commits mailing list
> >>>>>>> llvm-commits at cs.uiuc.edu
> >>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> llvm-commits mailing list
> >>>>>> llvm-commits at cs.uiuc.edu
> >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>
> >
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150611/e1f1e1f4/attachment.html>