<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi Nadav,<br>
<br>
that's the whole point of it. I can't in general make the index
calculation simpler. The example given is the simplest non-trivial
index function that is needed. It might well be that it's that
simple that the index calculation in this case can be thrown aways
altogether and - as you say - be replaced by the simple loop you
mentioned. However, this cannot be done in the general case. In
the general case the index calculation requires the 'rem' and
'div' instruction. The OR instruction must be the result of an
arithmetic transformation of one of the previous passes (due to
-O3?).<br>
<br>
I don't see a way around to make the vectorizers recognize such
arithmetic operations.<br>
<br>
Do you think this can be done relatively quickly or does this
involve a huge effort?<br>
<br>
What does 'stepping through' the loop vectorizer mean? Using the
debugger and step through the program? Probably not. Is the way to
go, to alter the 'canVectorize' function print debug output to see
what's going on?<br>
<br>
Hal, you seem to know the loop vectorizer. Is this the place to
look at or is the SLP vectorizer the more promising place?<br>
<br>
Frank<br>
<br>
<br>
On 31/10/13 12:50, Nadav Rotem wrote:<br>
</div>
<blockquote
cite="mid:916FD825-98E5-4E99-9D1F-5FB46F0CEEF0@apple.com"
type="cite">
<div>Hi Frank, </div>
<div><br>
</div>
<div>This loop should be vectorized by the SLP-vectorizer. It has
several scalars (C[0], C[1] … ) that can be merged into a
vector. The SLP vectorizer can’t figure out that the stores are
consecutive because SCEV can’t analyze the OR in the index
calculation:</div>
<div> </div>
<div> %2 = and i64 %i.04, 3<br>
%3 = lshr i64 %i.04, 2<br>
%4 = shl i64 %3, 3<br>
%5 = or i64 %4, %2<br>
%11 = getelementptr inbounds float* %c, i64 %5<br>
store float %10, float* %11, align 4, !tbaa !0<br>
<br>
</div>
<div>You wrote that you want each iteration to look like this:</div>
<div><br>
</div>
<div>
<blockquote type="cite">
<div bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><tt><br>
</tt><tt>i 0: 0 1 2 3 </tt><tt><br>
</tt><tt>i 4: 8 9 10 11 </tt><tt><br>
</tt><tt>i 8: 16 17 18 19 </tt><tt><br>
</tt><tt>i 12: 24 25 26 27<br>
</tt></div>
</div>
</blockquote>
</div>
<div><br>
</div>
<div>Why can’t you just write a small loop like this: for (i=0;
i<4; i++) C[i] = A[i] + B[i] ?? Either the unroller will
unroll it and the SLP-vectorizer will vectorize the unrolled
iterations, or the loop-vectorizer would catch it. </div>
<div><br>
</div>
<div>Thanks,</div>
<div>Nadav</div>
<br>
<div>
<div>On Oct 31, 2013, at 8:01 AM, Frank Winter <<a
moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite">
<div bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><tt>A quite small but yet
complete example function which all vectorization passes
fail to optimize:</tt><tt><br>
</tt><tt><br>
</tt><tt>#include <cstdint></tt><tt><br>
</tt><tt>#include <iostream></tt><tt><br>
</tt><tt><br>
</tt><tt>void bar(std::uint64_t start, std::uint64_t end,
float * __restrict__ c, float * __restrict__ a, float *
__restrict__ b)</tt><tt><br>
</tt><tt>{</tt><tt><br>
</tt><tt> for ( std::uint64_t i = start ; i < end ; i
+= 4 ) {</tt><tt><br>
</tt><tt> {</tt><tt><br>
</tt><tt> const std::uint64_t ir0 = (i+0)%4 +
8*((i+0)/4);</tt><tt><br>
</tt><tt> c[ ir0 ] = a[ ir0 ] + b[
ir0 ];</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> {</tt><tt><br>
</tt><tt> const std::uint64_t ir0 = (i+1)%4 +
8*((i+1)/4);</tt><tt><br>
</tt><tt> c[ ir0 ] = a[ ir0 ] + b[
ir0 ];</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> {</tt><tt><br>
</tt><tt> const std::uint64_t ir0 = (i+2)%4 +
8*((i+2)/4);</tt><tt><br>
</tt><tt> c[ ir0 ] = a[ ir0 ] + b[
ir0 ];</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> {</tt><tt><br>
</tt><tt> const std::uint64_t ir0 = (i+3)%4 +
8*((i+3)/4);</tt><tt><br>
</tt><tt> c[ ir0 ] = a[ ir0 ] + b[
ir0 ];</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt>} </tt><tt><br>
</tt><tt><br>
</tt><tt>The loop index and array accesses for the first 4
iterations:</tt><tt><br>
</tt><tt><br>
</tt><tt>i 0: 0 1 2 3 </tt><tt><br>
</tt><tt>i 4: 8 9 10 11 </tt><tt><br>
</tt><tt>i 8: 16 17 18 19 </tt><tt><br>
</tt><tt>i 12: 24 25 26 27<br>
</tt><br>
<pre class="bz_comment_text" id="comment_text_1">For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.
With current trunk I tried the following on the above example:
clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll
All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.
How can the SCEV AA pass be extended to handle this type of arithmetic?
</pre>
<tt>Frank<br>
</tt><br>
<pre class="bz_comment_text" id="comment_text_1">On 31/10/13 02:21, Renato Golin wrote:
</pre>
</div>
<blockquote
cite="mid:CAMSE1ke-_kjwTfWop3h86C133o=6QAHivCzAUYkwd6kuzqq5Sw@mail.gmail.com"
type="cite">
<div dir="ltr">On 30 October 2013 18:40, Frank Winter <<a
moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>
wrote:<br>
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote">
<div bgcolor="#FFFFFF" text="#000000">
<div> const std::uint64_t ir0 = (i+0)%4;
// not working<br>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>I thought this would be the case when I saw the
original expression. Maybe we need to teach module
arithmetic to SCEV?</div>
</div>
<br>
</div>
<div class="gmail_extra">--renato</div>
</div>
</blockquote>
<br>
<br>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
<br>
</body>
</html>