[PATCH] Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86

Wed Jun 10 11:01:13 PDT 2015

Hi Nadav,

Thank you for your detailed explanation. As you said, it adds a lot of
complexity to model architecture details. I will see how to do this in
codegen.

Wei.

On Wed, Jun 10, 2015 at 9:42 AM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Wei,
>
> Thank you for working on this. The SLP and Loop vectorizers (and the target transform info) can’t model the entire complexity of the different targets. The vectorizers do not model the cost of addressing modes and assume that folding memory operations into instructions is only a code size reduction that does not influence runtime performance because out-of-order processors translate these instructions into multiple uops that perform the load/store and the arithmetic operation. This is only an approximation of the way out-of-order processors work, but we need to make such assumptions if we want to limit the complexity of the vectorizer.  One exception to this rule is the special handling of vector geps, and scatter/gather instructions. Your example demonstrates that the assumption that we make about how out-of-order processors work is not always true. However, I still believe that the vectorizers should not model things like instruction encodings because it can complicate the vectorizer and TTI significantly. I believe that a better place to make a decision about the optimal code sequence in your example would be SelectionDAG. The codegen has more information to make such a decision. We don’t want TTI to expose an API that will be the superset of all target specific information that any pass may care about. I suggest that we keep the current vectorizer cost model and implement a peephole to reverse vectorization if needed, in the x86 backend. We already do something similar for wide AVX loads. On Sandybridge it is often beneficial to split 256bit loads/stores, and the decision to split such loads is done in the codegen and not inside the vectorizer.
>
> Thanks,
> Nadav
>
>
>> On Jun 9, 2015, at 5:33 PM, Wei Mi <wmi at google.com> wrote:
>>
>> Hi nadav, aschwaighofer,
>>
>> This is the patch to fix the performance problem reported in https://llvm.org/bugs/show_bug.cgi?id=23510.
>>
>> Many X86 scalar instructions support using memory operand as destination but most vector instructions do not support it. In SLP cost evaluation,
>>
>> scalar version:
>> t1 = load [mem];
>> t1 = shift 5, t1
>> store t1, [mem]
>> ...
>> t4 = load [mem4];
>> t4 = shift 5, t4
>> store t4, [mem4]
>>
>> slp vectorized version:
>> v1 = vload [mem];
>> v1 = vshift 5, v1
>> store v1, [mem]
>>
>> SLP cost model thinks there will be 12 - 3 = 9 insns savings. But scalar version can be converted to the following form on x86 while vectorized instruction cannot:
>>
>> [mem1] = shift 5, [mem1]
>> [mem2] = shift 5, [mem2]
>> [mem3] = shift 5, [mem3]
>> [mem4] = shift 5, [mem4]
>>
>> We add the extra cost VL * 2 to the SLP cost evaluation to handle such case (VL is the vector length).
>>
>> REPOSITORY
>> rL LLVM
>>
>> http://reviews.llvm.org/D10352
>>
>> Files:
>> include/llvm/Analysis/TargetTransformInfo.h
>> include/llvm/Analysis/TargetTransformInfoImpl.h
>> lib/Analysis/TargetTransformInfo.cpp
>> lib/Target/X86/X86TargetTransformInfo.cpp
>> lib/Target/X86/X86TargetTransformInfo.h
>> lib/Transforms/Vectorize/SLPVectorizer.cpp
>> test/Transforms/SLPVectorizer/X86/pr23510.ll
>>
>> EMAIL PREFERENCES
>> http://reviews.llvm.org/settings/panel/emailpreferences/
>> <D10352.27417.patch>
>