[PATCH] Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86

Thu Jun 11 14:34:24 PDT 2015

On Thu, Jun 11, 2015 at 2:30 PM, Eric Christopher <echristo at gmail.com> wrote:
>
>
> On Thu, Jun 11, 2015 at 2:27 PM Xinliang David Li <davidxl at google.com>
> wrote:
>>
>> On Thu, Jun 11, 2015 at 2:19 PM, Arnold Schwaighofer
>> <aschwaighofer at apple.com> wrote:
>> >
>> >> On Jun 11, 2015, at 2:02 PM, Xinliang David Li <davidxl at google.com>
>> >> wrote:
>> >>
>> >>>
>> >>> The “best solution” would be to send the vectorized code to the
>> >>> backend and ask whether it is better or worse - We don’t want to do this
>> >>> because it is compile time expensive.
>> >>
>> >> Why is it compile time expensive?
>> >
>> > Rerunning parts of the pipeline (at least CodeGenPrepare) and ISel to
>> > MachineIR for every vectorized and the scalar variant is expensive.
>> > Realistically you would want to run more of the LLVM IR pipeline to cleanup
>> > the vector code to get an somewhat accurate estimate.
>>
>> May not be so expensive relatively though. Compile time these days is
>> mostly spent in parsing.
>
>
> We can't necessarily take this as a valid assumption. There are a lot of
> uses of llvm outside of traditional compilation that'll want vectorization.
> (e.g. run time compilation of gpu shaders etc).

Fair enough, so perhaps differentiate this with two modes: one mode
with more precise cost model and one with lightweight model?

David

>
> -eric
>
>>
>> David
>>
>>
>> >
>> >
>> >>
>> >> David
>> >>
>> >>
>> >>> So we have solutions that are going to be imprecise and the question
>> >>> is how complex to we want to make our modeling.
>> >>>
>> >>>
>> >>>>>> 2) "We don’t want TTI to expose an API that will be the superset of
>> >>>>>> all target specific information …”: What belongs to the TTI and what
>> >>>>>> doesn’t? Can’t  a target decide for itself?
>> >>>
>> >>> I think what that means is that we don’t want to expose every targets
>> >>> idiosyncrasy at the LLVM IR level. There might of course be room for certain
>> >>> things. Again any proposal has to be evaluated individually.
>> >>>
>> >>>>>> 3) " implement a peephole to reverse vectorization if needed, in
>> >>>>>> the x86 backend”: that amounts to “Work. Check if that work should have been
>> >>>>>> done. If not undo that work and do it better.” and is certainly more
>> >>>>>> expensive than querying a cost function. And this is both harder and a
>> >>>>>> different concept than possibly splitting memory ops late in the pipeline.
>> >>>
>> >>> Frankly, I believe the cost is neglectable and not really a concern
>> >>> here.
>> >>>
>> >>> This would probably be a simple ISel pattern family, what is complex
>> >>> about this?
>> >>>
>> >>> (store (shr/… (load <2 x i32> )) addr) -> scalar version ...
>> >>>
>> >>> If we can be more precise I am all for it if we can do it with
>> >>> reasonable complexity. Replicating ISel should be a non-goal in my opinion.
>> >>>
>> >>> In this concrete instance we might decide to make a shr with a (small)
>> >>> constant more expensive for vectors of two i64 to not pessimize scalar code
>> >>> (and maybe other combinations) . Hard to say I have not done the thorough
>> >>> analysis. Making such a change requires analysis of a set of micro kernels,
>> >>> the architecture and so on … all the suggestions that I have made are based
>> >>> on incomplete information.
>> >>>
>> >>>
>> >>> What are you proposing? Only a concrete solution can be evaluated.
>> >>>
>> >>>
>> >>>> On Jun 11, 2015, at 11:35 AM, Gerolf Hoflehner <ghoflehner at apple.com>
>> >>>> wrote:
>> >>>>
>> >>>> When the analysis is off nothing is to be done.  That should be easy
>> >>>> to verify.
>> >>>>
>> >>>> Yes, there is plenty of problems with the patch itself. But all that
>> >>>> comes later and is orthogonal to the big picture questions I'm after.
>> >>>>
>> >>>> Gerolf
>> >>>>
>> >>>> Sent from my iPhone
>> >>>>
>> >>>>> On Jun 10, 2015, at 8:05 PM, Arnold <aschwaighofer at apple.com> wrote:
>> >>>>>
>> >>>>> Are we even sure we have the right explanation of why the scalar
>> >>>>> version is faster and that we would be modeling the right thing here? Sure
>> >>>>> the Intel Isa has two memory ops for scalar insts  but when that gets
>> >>>>> translated by the processors front end I would expect this to be translated
>> >>>>> to the similar micro ops as the vectorized version (other than the obvious
>> >>>>> difference). One can consult Agner's instructions manuals to get these
>> >>>>> values ....
>> >>>>>
>> >>>>> A quick glance at Haswell:
>> >>>>>
>> >>>>> 4 x Shr m,i unfused uops 4 = 16 uops
>> >>>>>
>> >>>>> Vs
>> >>>>>
>> >>>>> 1 x movdqa m,x uops 2 +
>> >>>>> 1 x psrlq uops 2 +
>> >>>>> 1 x movdqa x,m uops 1 = 5 uops
>> >>>>>
>> >>>>> Now we would have to look at execution Ports what else is executing,
>> >>>>> etc
>> >>>>>
>> >>>>> Is the scalar version always faster? or only in this benchmark's
>> >>>>> context? Does the behavior depend on other instructions executing
>> >>>>> concurrently?
>> >>>>>
>> >>>>> Why did we choose the cost of two time the vector length? Would this
>> >>>>> be the right constant on all targets? How does that related to the actual
>> >>>>> execution on an out of order target processor? Is the throughput half?
>> >>>>>
>> >>>>> We are never going to estimate the cost precisely without lowering
>> >>>>> to machine instructions and simulating the processor (for example, we do
>> >>>>> this in the machine combiner or the machine if converter where we simulate
>> >>>>> somewhat realistically throughput and critical path).
>> >>>>>
>> >>>>> So more 'precision' along one dimension on llvm IR might only mean a
>> >>>>> different lie but at the cost of added complexity. This is a trade off.
>> >>>>>
>> >>>>> I am not convinced that enough evidence has been established that
>> >>>>> the memory operand is the issue here and the model suggested is the right
>> >>>>> one.
>> >>>>>
>> >>>>> I also don't think that we want to implement part of target
>> >>>>> lowering's logic in the vectorizer. Sure this one example does not add much
>> >>>>> but there would be more ...
>> >>>>>
>> >>>>> Another way of looking at this is if the scalar version is always
>> >>>>> better should we scalarize irrespective of whether that code came out of the
>> >>>>> vectorizer or was written otherwise (clang vector intrinsics)?
>> >>>>>
>> >>>>> Sent from my iPhone
>> >>>>>
>> >>>>>> On Jun 10, 2015, at 6:13 PM, Gerolf Hoflehner
>> >>>>>> <ghoflehner at apple.com> wrote:
>> >>>>>>
>> >>>>>> Hi Nadav,
>> >>>>>>
>> >>>>>> I’d like to dig more into this a bit more. It relates to the
>> >>>>>> question what the cost model for vectorizers should look like.
>> >>>>>> 1) "the vectorizers should not model things like instruction
>> >>>>>> encodings because it can complicate the vectorizer and TTI significantly”:
>> >>>>>> You point to some HW specific assumptions and this is one more. From the
>> >>>>>> code I don’t see that it increases complexity - at least not significantly.
>> >>>>>> And a target may decide that more expensive cost analysis is worth the
>> >>>>>> effort. The general follow up questions are: Where is the line what the cost
>> >>>>>> model should take into and what it shouldn’t? And when is it “too complex”?
>> >>>>>> 2) "We don’t want TTI to expose an API that will be the superset of
>> >>>>>> all target specific information …”: What belongs to the TTI and what
>> >>>>>> doesn’t? Can’t  a target decide for itself?
>> >>>>>> 3) " implement a peephole to reverse vectorization if needed, in
>> >>>>>> the x86 backend”: that amounts to “Work. Check if that work should have been
>> >>>>>> done. If not undo that work and do it better.” and is certainly more
>> >>>>>> expensive than querying a cost function. And this is both harder and a
>> >>>>>> different concept than possibly splitting memory ops late in the pipeline.
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Gerolf
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>> On Jun 10, 2015, at 9:42 AM, Nadav Rotem <nrotem at apple.com> wrote:
>> >>>>>>>
>> >>>>>>> Hi Wei,
>> >>>>>>>
>> >>>>>>> Thank you for working on this. The SLP and Loop vectorizers (and
>> >>>>>>> the target transform info) can’t model the entire complexity of the
>> >>>>>>> different targets. The vectorizers do not model the cost of addressing modes
>> >>>>>>> and assume that folding memory operations into instructions is only a code
>> >>>>>>> size reduction that does not influence runtime performance because
>> >>>>>>> out-of-order processors translate these instructions into multiple uops that
>> >>>>>>> perform the load/store and the arithmetic operation. This is only an
>> >>>>>>> approximation of the way out-of-order processors work, but we need to make
>> >>>>>>> such assumptions if we want to limit the complexity of the vectorizer.  One
>> >>>>>>> exception to this rule is the special handling of vector geps, and
>> >>>>>>> scatter/gather instructions. Your example demonstrates that the assumption
>> >>>>>>> that we make about how out-of-order processors work is not always true.
>> >>>>>>> However, I still believe that the vectorizers should not model things like
>> >>>>>>> instruction encodings because it can complicate the vectorizer and TTI
>> >>>>>>> significantly. I believe that a better place to make a decision about the
>> >>>>>>> optimal code sequence in your example would be SelectionDAG. The codegen has
>> >>>>>>> more information to make such a decision. We don’t want TTI to expose an API
>> >>>>>>> that will be the superset of all target specific information that any pass
>> >>>>>>> may care about. I suggest that we keep the current vectorizer cost model and
>> >>>>>>> implement a peephole to reverse vectorization if needed, in the x86 backend.
>> >>>>>>> We already do something similar for wide AVX loads. On Sandybridge it is
>> >>>>>>> often beneficial to split 256bit loads/stores, and the decision to split
>> >>>>>>> such loads is done in the codegen and not inside the vectorizer.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Nadav
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>> On Jun 9, 2015, at 5:33 PM, Wei Mi <wmi at google.com> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi nadav, aschwaighofer,
>> >>>>>>>>
>> >>>>>>>> This is the patch to fix the performance problem reported in
>> >>>>>>>> https://llvm.org/bugs/show_bug.cgi?id=23510.
>> >>>>>>>>
>> >>>>>>>> Many X86 scalar instructions support using memory operand as
>> >>>>>>>> destination but most vector instructions do not support it. In SLP cost
>> >>>>>>>> evaluation,
>> >>>>>>>>
>> >>>>>>>> scalar version:
>> >>>>>>>> t1 = load [mem];
>> >>>>>>>> t1 = shift 5, t1
>> >>>>>>>> store t1, [mem]
>> >>>>>>>> ...
>> >>>>>>>> t4 = load [mem4];
>> >>>>>>>> t4 = shift 5, t4
>> >>>>>>>> store t4, [mem4]
>> >>>>>>>>
>> >>>>>>>> slp vectorized version:
>> >>>>>>>> v1 = vload [mem];
>> >>>>>>>> v1 = vshift 5, v1
>> >>>>>>>> store v1, [mem]
>> >>>>>>>>
>> >>>>>>>> SLP cost model thinks there will be 12 - 3 = 9 insns savings. But
>> >>>>>>>> scalar version can be converted to the following form on x86 while
>> >>>>>>>> vectorized instruction cannot:
>> >>>>>>>>
>> >>>>>>>> [mem1] = shift 5, [mem1]
>> >>>>>>>> [mem2] = shift 5, [mem2]
>> >>>>>>>> [mem3] = shift 5, [mem3]
>> >>>>>>>> [mem4] = shift 5, [mem4]
>> >>>>>>>>
>> >>>>>>>> We add the extra cost VL * 2 to the SLP cost evaluation to handle
>> >>>>>>>> such case (VL is the vector length).
>> >>>>>>>>
>> >>>>>>>> REPOSITORY
>> >>>>>>>> rL LLVM
>> >>>>>>>>
>> >>>>>>>> http://reviews.llvm.org/D10352
>> >>>>>>>>
>> >>>>>>>> Files:
>> >>>>>>>> include/llvm/Analysis/TargetTransformInfo.h
>> >>>>>>>> include/llvm/Analysis/TargetTransformInfoImpl.h
>> >>>>>>>> lib/Analysis/TargetTransformInfo.cpp
>> >>>>>>>> lib/Target/X86/X86TargetTransformInfo.cpp
>> >>>>>>>> lib/Target/X86/X86TargetTransformInfo.h
>> >>>>>>>> lib/Transforms/Vectorize/SLPVectorizer.cpp
>> >>>>>>>> test/Transforms/SLPVectorizer/X86/pr23510.ll
>> >>>>>>>>
>> >>>>>>>> EMAIL PREFERENCES
>> >>>>>>>> http://reviews.llvm.org/settings/panel/emailpreferences/
>> >>>>>>>> <D10352.27417.patch>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> _______________________________________________
>> >>>>>>> llvm-commits mailing list
>> >>>>>>> llvm-commits at cs.uiuc.edu
>> >>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >>>>>>
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> llvm-commits mailing list
>> >>>>>> llvm-commits at cs.uiuc.edu
>> >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >>>
>> >
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits