<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">I tried the following on the
hand-unrolled loop:<br>
<br>
const std::uint64_t ir0 = i*8+0; // working<br>
<br>
const std::uint64_t ir0 = i%4+0; // working<br>
<br>
const std::uint64_t ir0 = (i+0)%4; // not working<br>
<br>
'+0' means +1,+2,+3 in the unrolled iterations.<br>
<br>
'Working' means the SLP vectorizer succeeded.<br>
<br>
Thus, when working 'towards' the correct index function, auto
vectorization fails. However, there is no option to use a simpler
index function.<br>
<br>
Is it possible to make the SCEV pass more smart? Or would you
strongly advise against such endeavor?<br>
<br>
Frank<br>
<br>
<br>
On 30/10/13 21:16, Nadav Rotem wrote:<br>
</div>
<blockquote
cite="mid:4BFE59B4-FF79-4006-8E63-866CE817A615@apple.com"
type="cite"><br>
<div>
<div>On Oct 30, 2013, at 6:10 PM, Frank Winter <<a
moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite">the only option I see is to unroll the
loop by hand. Since the array access is consecutive over 4
loop iterations I gave it a try and unrolled the loop by a
factor of 4. Which gives the following array accesses:<br>
<br>
loop iter 0:<br>
index_0 = 0 index_1 = 4<br>
index_0 = 1 index_1 = 5<br>
index_0 = 2 index_1 = 6<br>
index_0 = 3 index_1 = 7<br>
<br>
loop iter 1:<br>
index_0 = 8 index_1 = 12<br>
index_0 = 9 index_1 = 13<br>
index_0 = 10 index_1 = 14<br>
index_0 = 11 index_1 = 15<br>
</blockquote>
<div><br>
</div>
<div>The SLP-vectorizer detects 8 stores, but it can’t prove
that they are consecutive, so it moves on. Can you simplify
the address expression ? Can you write " index0 = i*8 + 0 “
and give it a try ?</div>
<br>
<blockquote type="cite"><br>
For completeness, here the code:<br>
<br>
void bar(std::uint64_t start, std::uint64_t end, float *
__restrict__ c, float * __restrict__ a, float * __restrict__
b)<br>
{<br>
const std::uint64_t inner = 4;<br>
for (std::uint64_t i = start ; i < end ; i+=4 ) {<br>
{<br>
const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) *
inner + (i+0)%4;<br>
const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) *
inner + (i+0)%4;<br>
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];<br>
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];<br>
}<br>
{<br>
const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) *
inner + (i+1)%4;<br>
const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) *
inner + (i+1)%4;<br>
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];<br>
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];<br>
}<br>
{<br>
const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) *
inner + (i+2)%4;<br>
const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) *
inner + (i+2)%4;<br>
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];<br>
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];<br>
}<br>
{<br>
const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) *
inner + (i+3)%4;<br>
const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) *
inner + (i+3)%4;<br>
c[ ir0 ] = a[ ir0 ] + b[ ir0 ];<br>
c[ ir1 ] = a[ ir1 ] + b[ ir1 ];<br>
}<br>
}<br>
}<br>
<br>
<br>
This should be an ideal test case for the SLP vectorizer,
right?<br>
<br>
It seems, I am out of luck:<br>
<br>
opt -O3 -vectorize-slp -debug loop.ll -S<br>
<br>
SLP: Analyzing blocks in _Z3barmmPfS_S_.<br>
SLP: Found 8 stores to vectorize.<br>
SLP: Analyzing a store chain of length 8.<br>
SLP: Trying to vectorize starting at PHIs (1)<br>
SLP: Vectorizing a list of length = 2.<br>
SLP: Vectorizing a list of length = 2.<br>
SLP: Vectorizing a list of length = 2.<br>
</blockquote>
</div>
<br>
</blockquote>
<br>
<br>
</body>
</html>