[PATCH] Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86

Thu Jun 11 13:41:34 PDT 2015

To me, the big picture is that we are trying to model something at the LLVM IR level where we have very incomplete information which means it will depend on the problem in question whether we can meaningfully model the problem or not and whether modeling it makes sense given the imprecision we have or whether other solutions make more sense 

In Wei's last email he has kind of confirmed my suspicions that this is a case where it is not enough to just look at the pattern itself but really it depends what else is in the pipeline:

>From the target benchmark, for vectorized version, uops count is
decreased but the count of reservation station full event is
increased. ***I can create another small testcase and the vectorized
version is better***. So it may not be proper to simply use instruction
numbers to evaluate the cost. I will investigate the microarchitecture
behavior of the target benchmark some more and find a better solution.

In my last email, I have proposed to increase the cost of certain vector instructions but even that might probably not be right here because throughput depends on what else is in the pipeline and would also potentially depend on whether the value is in register or not (what if the constant is bigger than what the isa allows for holding in an immediate). Such decisions we really want to make in the backend where we are a lot better equipped to make such judgment.

Trying to answer your questions:

>>> 1) "the vectorizers should not model things like instruction encodings because it can complicate the vectorizer and TTI significantly”: You point to some HW specific assumptions and this is one more. From the code I don’t see that it increases complexity - at least not significantly.

As I have said before, one logic to peephole a pattern does not add complexity. But there will be more ...

>>> And a target may decide that more expensive cost analysis is worth the effort. The general follow up questions are: Where is the line what the cost model should take into and what it shouldn’t? And when is it “too complex”?

If we start implementing ISel like cost estimation logic (i.e. a question like “can we use two memops in one instruction for this tree”) in the vectorizer that is too complex (at least so far). But it turned out that this is not the problem here anyway.

So far we have taken the stance that the cost model should not look more than one instruction because it can not possibly know what instruction selection will match and if it wanted to it would have to replicate parts of it. You can’t make a precise judgement about performance by just looking at one instruction ore even an small instruction tree - in many cases where such fine grained modeling matters other factors come into play like what else is executing in the pipeline, what is your critical path, etc). This approach of course is bound to have imprecision.

One could try to pattern patch some instruction trees but that adds the complexity of replicating parts of ISel with less precision and more complexity. In many cases we can fix those in ISel. So far we have not decide to take this approach. I could imaging a framework where our cost estimation calls a tree matcher first to filter out certain instructions patterns that are not to be considered as part of the regular per instruction cost estimation and that would allow targets to assign costs to trees. But we would have to have some examples where this is the only way to get a good result to justifying doing this in my opinion. Something like that was not suggested by the patch though.

If there is a case where that makes sense we can evaluate it but if there are reasonable ways to work around it and defer to ISel I think that is the better choice in the complexity vs precision tradeoff.

The “best solution” would be to send the vectorized code to the backend and ask whether it is better or worse - We don’t want to do this because it is compile time expensive. So we have solutions that are going to be imprecise and the question is how complex to we want to make our modeling.

>>> 2) "We don’t want TTI to expose an API that will be the superset of all target specific information …”: What belongs to the TTI and what doesn’t? Can’t  a target decide for itself?

I think what that means is that we don’t want to expose every targets idiosyncrasy at the LLVM IR level. There might of course be room for certain things. Again any proposal has to be evaluated individually.

>>> 3) " implement a peephole to reverse vectorization if needed, in the x86 backend”: that amounts to “Work. Check if that work should have been done. If not undo that work and do it better.” and is certainly more expensive than querying a cost function. And this is both harder and a different concept than possibly splitting memory ops late in the pipeline.

Frankly, I believe the cost is neglectable and not really a concern here.

This would probably be a simple ISel pattern family, what is complex about this?

(store (shr/… (load <2 x i32> )) addr) -> scalar version ...

If we can be more precise I am all for it if we can do it with reasonable complexity. Replicating ISel should be a non-goal in my opinion.

In this concrete instance we might decide to make a shr with a (small) constant more expensive for vectors of two i64 to not pessimize scalar code (and maybe other combinations) . Hard to say I have not done the thorough analysis. Making such a change requires analysis of a set of micro kernels, the architecture and so on … all the suggestions that I have made are based on incomplete information.

What are you proposing? Only a concrete solution can be evaluated.

> On Jun 11, 2015, at 11:35 AM, Gerolf Hoflehner <ghoflehner at apple.com> wrote:
> 
> When the analysis is off nothing is to be done.  That should be easy to verify.
> 
> Yes, there is plenty of problems with the patch itself. But all that comes later and is orthogonal to the big picture questions I'm after.
> 
> Gerolf
> 
> Sent from my iPhone
> 
>> On Jun 10, 2015, at 8:05 PM, Arnold <aschwaighofer at apple.com> wrote:
>> 
>> Are we even sure we have the right explanation of why the scalar version is faster and that we would be modeling the right thing here? Sure the Intel Isa has two memory ops for scalar insts  but when that gets translated by the processors front end I would expect this to be translated to the similar micro ops as the vectorized version (other than the obvious difference). One can consult Agner's instructions manuals to get these values ....
>> 
>> A quick glance at Haswell:
>> 
>> 4 x Shr m,i unfused uops 4 = 16 uops
>> 
>> Vs 
>> 
>> 1 x movdqa m,x uops 2 +
>> 1 x psrlq uops 2 +
>> 1 x movdqa x,m uops 1 = 5 uops
>> 
>> Now we would have to look at execution Ports what else is executing, etc
>> 
>> Is the scalar version always faster? or only in this benchmark's context? Does the behavior depend on other instructions executing concurrently?
>> 
>> Why did we choose the cost of two time the vector length? Would this be the right constant on all targets? How does that related to the actual execution on an out of order target processor? Is the throughput half?
>> 
>> We are never going to estimate the cost precisely without lowering to machine instructions and simulating the processor (for example, we do this in the machine combiner or the machine if converter where we simulate somewhat realistically throughput and critical path).
>> 
>> So more 'precision' along one dimension on llvm IR might only mean a different lie but at the cost of added complexity. This is a trade off.
>> 
>> I am not convinced that enough evidence has been established that the memory operand is the issue here and the model suggested is the right one.
>> 
>> I also don't think that we want to implement part of target lowering's logic in the vectorizer. Sure this one example does not add much but there would be more ...
>> 
>> Another way of looking at this is if the scalar version is always better should we scalarize irrespective of whether that code came out of the vectorizer or was written otherwise (clang vector intrinsics)?
>> 
>> Sent from my iPhone
>> 
>>> On Jun 10, 2015, at 6:13 PM, Gerolf Hoflehner <ghoflehner at apple.com> wrote:
>>> 
>>> Hi Nadav,
>>> 
>>> I’d like to dig more into this a bit more. It relates to the question what the cost model for vectorizers should look like.
>>> 1) "the vectorizers should not model things like instruction encodings because it can complicate the vectorizer and TTI significantly”: You point to some HW specific assumptions and this is one more. From the code I don’t see that it increases complexity - at least not significantly. And a target may decide that more expensive cost analysis is worth the effort. The general follow up questions are: Where is the line what the cost model should take into and what it shouldn’t? And when is it “too complex”?
>>> 2) "We don’t want TTI to expose an API that will be the superset of all target specific information …”: What belongs to the TTI and what doesn’t? Can’t  a target decide for itself?
>>> 3) " implement a peephole to reverse vectorization if needed, in the x86 backend”: that amounts to “Work. Check if that work should have been done. If not undo that work and do it better.” and is certainly more expensive than querying a cost function. And this is both harder and a different concept than possibly splitting memory ops late in the pipeline.
>>> 
>>> Thanks
>>> Gerolf
>>> 
>>> 
>>> 
>>>> On Jun 10, 2015, at 9:42 AM, Nadav Rotem <nrotem at apple.com> wrote:
>>>> 
>>>> Hi Wei, 
>>>> 
>>>> Thank you for working on this. The SLP and Loop vectorizers (and the target transform info) can’t model the entire complexity of the different targets. The vectorizers do not model the cost of addressing modes and assume that folding memory operations into instructions is only a code size reduction that does not influence runtime performance because out-of-order processors translate these instructions into multiple uops that perform the load/store and the arithmetic operation. This is only an approximation of the way out-of-order processors work, but we need to make such assumptions if we want to limit the complexity of the vectorizer.  One exception to this rule is the special handling of vector geps, and scatter/gather instructions. Your example demonstrates that the assumption that we make about how out-of-order processors work is not always true. However, I still believe that the vectorizers should not model things like instruction encodings because it can complicate the vectorizer and TTI significantly. I believe that a better place to make a decision about the optimal code sequence in your example would be SelectionDAG. The codegen has more information to make such a decision. We don’t want TTI to expose an API that will be the superset of all target specific information that any pass may care about. I suggest that we keep the current vectorizer cost model and implement a peephole to reverse vectorization if needed, in the x86 backend. We already do something similar for wide AVX loads. On Sandybridge it is often beneficial to split 256bit loads/stores, and the decision to split such loads is done in the codegen and not inside the vectorizer. 
>>>> 
>>>> Thanks,
>>>> Nadav 
>>>> 
>>>> 
>>>>> On Jun 9, 2015, at 5:33 PM, Wei Mi <wmi at google.com> wrote:
>>>>> 
>>>>> Hi nadav, aschwaighofer,
>>>>> 
>>>>> This is the patch to fix the performance problem reported in https://llvm.org/bugs/show_bug.cgi?id=23510.
>>>>> 
>>>>> Many X86 scalar instructions support using memory operand as destination but most vector instructions do not support it. In SLP cost evaluation, 
>>>>> 
>>>>> scalar version:
>>>>> t1 = load [mem];
>>>>> t1 = shift 5, t1          
>>>>> store t1, [mem]
>>>>> ...
>>>>> t4 = load [mem4];
>>>>> t4 = shift 5, t4          
>>>>> store t4, [mem4]
>>>>> 
>>>>> slp vectorized version:
>>>>> v1 = vload [mem];
>>>>> v1 = vshift 5, v1
>>>>> store v1, [mem]
>>>>> 
>>>>> SLP cost model thinks there will be 12 - 3 = 9 insns savings. But scalar version can be converted to the following form on x86 while vectorized instruction cannot:
>>>>> 
>>>>> [mem1] = shift 5, [mem1]
>>>>> [mem2] = shift 5, [mem2]
>>>>> [mem3] = shift 5, [mem3]
>>>>> [mem4] = shift 5, [mem4]
>>>>> 
>>>>> We add the extra cost VL * 2 to the SLP cost evaluation to handle such case (VL is the vector length).
>>>>> 
>>>>> REPOSITORY
>>>>> rL LLVM
>>>>> 
>>>>> http://reviews.llvm.org/D10352
>>>>> 
>>>>> Files:
>>>>> include/llvm/Analysis/TargetTransformInfo.h
>>>>> include/llvm/Analysis/TargetTransformInfoImpl.h
>>>>> lib/Analysis/TargetTransformInfo.cpp
>>>>> lib/Target/X86/X86TargetTransformInfo.cpp
>>>>> lib/Target/X86/X86TargetTransformInfo.h
>>>>> lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>>> test/Transforms/SLPVectorizer/X86/pr23510.ll
>>>>> 
>>>>> EMAIL PREFERENCES
>>>>> http://reviews.llvm.org/settings/panel/emailpreferences/
>>>>> <D10352.27417.patch>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits