<div dir="ltr">On 14 October 2013 18:15, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word"><div>1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB.   At the moment we don’t have support for idempotence operations, but this is something that we should add. <br>

</div></div></blockquote><div><br></div><div>The fourth write is not necessary for GCC to vectorize it (nor was in the original code), but it was a result of CReduce's attempt to converge when running ARM's GCC and inspecting the right sequence of vector instructions. (btw, CReduce is great!).</div>

<div><br></div><div>In this case, shouldn't the vector operations to just add an undef to the fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec instruction, or just try to re-linearise?</div><div><br></div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div></div><div>2. The values that we are subtracting come from 3 loads.  We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512).  <br>

</div></div></blockquote><div><br></div><div>That is a more complicated issue, but we can get away with it if we, in a first implementation, only allow the same number of reads and writes on each loop. In that case, if the operations on the independent variables are identical, than it means the loop can be simplified by multiplying the induction range by N and reducing the number of load/sub/store lanes to one, in which case, loop vectorization becomes trivial.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div></div><div>Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ?<br>

</div></div></blockquote><div></div></div><br></div><div class="gmail_extra">Good question. What vectorizer does the "-ftree-vectorizer" turns on? Because if I use "-fno-tree-vectorize", the code remains scalar.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">cheers,</div><div class="gmail_extra">--renato</div></div>