<div dir="ltr">Ok any quick workaround to limit vectorization to 16-byte aligned 128-bit data then?<div><br></div><div>All the memory copying done by <span style="font-family:arial,sans-serif;font-size:13px">ExpandUnalignedStore/</span><span style="font-family:arial,sans-serif;font-size:13px">ExpandUnalignedLoad is just too expensive.</span></div>

</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Jul 20, 2013 at 12:52 PM, Arnold Schwaighofer <span dir="ltr"><<a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

On Jul 19, 2013, at 3:14 PM, Francois Pichet <<a href="mailto:pichet2000@gmail.com">pichet2000@gmail.com</a>> wrote:<br>

<br>

><br>

> What is the proper solution to disable auto-vectorization for unaligned data?<br>

><br>

> I have an out of tree target and I added this:<br>

><br>

> bool OpusTargetLowering::allowsUnalignedMemoryAccesses(EVT VT, bool *Fast) const {<br>

>   if (VT.isVector())<br>

>     return false;<br>

> ....<br>

> }<br>

><br>

> After that, I could see that vectorization is still done on unaligned data except that llvm will copy the data back and forth from the source to the top of the stack and work from there. This is very costly, I rather get scalar operations.<br>


><br>

> Then I tried to add:<br>

>   unsigned getMemoryOpCost(unsigned Opcode, Type *Src,<br>

>                            unsigned Alignment,<br>

>                            unsigned AddressSpace) const {<br>

>     if (Src->isVectorTy() && Alignment != 16)<br>

>       return 10000; // <== high number to try to avoid unaligned load/store.<br>

>     return TargetTransformInfo::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);<br>

>   }<br>

><br>

> Except that this doesn't work because Alignment will always be 4 even for data like:<br>

>        int   data[16][16] __attribute__ ((aligned (16))),<br>

><br>

> Because individual element are still 4-byte aligned.<br>

<br>

</div></div>We will have to hook up some logic in the loop vectorizer that computes the alignment of the vectorized version of the memory access so that we can pass it to “getMemoryOpCost". Currently, as you have observed, we will just pass the scalar loop’s memory access alignment which will be pessimistic.<br>


<br>

Instcombine will later replace the alignment to a stronger variant for vectorized code but that is obviously to late for the cost model in the vectorizer.</blockquote></div><br></div>