[LLVMdev] Limit loop vectorizer to SSE

Fri Nov 15 17:47:29 PST 2013

Ok, it seems that starting from Core i7 the unaligned
packed move is as fast as the aligned version and
should be used throughout.

Thanks,
Frank

On 15/11/13 20:34, Arnold Schwaighofer wrote:
> The vectorizer will now emit
>
> = load <8 x i32>, align #TargetAlignmentOfScalari32
>
> where before it would emit
>
> = load <8 x i32>
>
> (which has the semantics of “= load <8 xi32>, align 0” which means the address is aligned with target abi alignment, see http://llvm.org/docs/LangRef.html#load-instruction).
>
> When the backend generates code for the former it will emit an unaligned move:
>   = vmovups ...
> wheres for the later it will use an aligned move:
>   = vmovaps …
>
> vmovups can load from unaligned addresses while vmovaps can not.
>
> No, we currently don’t peel loops for alignment.
>
> Best,
> Arnold
>
> On Nov 15, 2013, at 7:23 PM, Frank Winter <fwinter at jlab.org> wrote:
>
>> I confirm that r194876 fixes the issue, i.e. segfault not caused.
>>
>> My program still passed 16 byte aligned pointers to the function
>> which the loop vectorizer processes successfully:
>>
>> LV: Vector loop of width 8 costs: 1.
>> LV: Selecting VF = : 8.
>> LV: Found a vectorizable loop (8) in func_orig.ll
>> LV: Unroll Factor is 1
>>
>> Since the program runs fine, it seems to be allowed for the CPU
>> to issue a vector load (8 floats) to a 16 byte aligned address (as
>> opposed to 32 byte aligned). Or does in fact the loop vectorizer
>> handle this case in the preamble and the vector.body issues only
>> 32 byte aligned accesses. In which case I would align the payload
>> to 32 byte in order to save a little in the preamble.
>>
>> Frank
>>
>>
>>
>> On 15/11/13 18:15, Arnold Schwaighofer wrote:
>>> A fix for this is in r194876.
>>>
>>> Thanks for reporting this!
>>>
>>>
>>> On Nov 15, 2013, at 3:49 PM, Joshua Klontz <josh.klontz at gmail.com> wrote:
>>>
>>>> Nadav,
>>>>
>>>> I believe aligned accesses to unaligned pointers is precisely the issue. Consider the function `add_u8S` before[1] and after[2] the loop vectorizer pass. There is no alignment assumption associated with %kernel_data prior to vectorization. I can't tell if it's the loop vectorizer or the codegen at fault, but the alignment assumption seems to sneak in somewhere.
>>>>
>>>> v/r,
>>>> Josh
>>>>
>>>> [1] http://pastebin.com/kc95WtGG
>>>> [2] http://pastebin.com/VY3ZLVJK
>>>>
>>>>
>>>> On Fri, Nov 15, 2013 at 3:58 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>>
>>>> On Nov 15, 2013, at 12:36 PM, Renato Golin <renato.golin at linaro.org> wrote:
>>>>
>>>>> On 15 November 2013 20:24, Joshua Klontz <josh.klontz at gmail.com> wrote:
>>>>> Agreed, is there a pass that will insert a runtime alignment check? Also, what's the easiest way to get at TargetTransformInfo::getRegisterBitWidth() so I don't have to hard code 32? Thanks!
>>>>>
>>>>> I think that's a fair question, and it's about safety. If you're getting this on the JIT, means we may be generating unsafe transformations on the vectorizer.
>>>>>
>>>>> Arnold, Nadav, I don't remember seeing code to generate any run-time alignment checks on the incoming pointer, is there such a thing? If not, shouldn't we add one?
>>>> If the the vectorizer generates aligned memory accesses to unaligned addresses then this is a serious bug.  But I don’t think that Josh said that the vectorizer generated aligned accesses to unaligned pointers.
>>>>
>>>> There is no point in LLVM checking for alignment because if the memory is unaligned then the program will crash.  Users who want to crash with a readable error message can simply write code that checks the pointer (by masking the high bits and comparing to zero).
>>>>
>>>>
>>