<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">I tried the following on the

      hand-unrolled loop:<br>

      <br>

            const std::uint64_t ir0 = i*8+0; // working<br>

      <br>

            const std::uint64_t ir0 = i%4+0; // working<br>

      <br>

            const std::uint64_t ir0 = (i+0)%4;  // not working<br>

      <br>

      '+0' means +1,+2,+3 in the unrolled iterations.<br>

      <br>

      'Working' means the SLP vectorizer succeeded.<br>

      <br>

      Thus, when working 'towards' the correct index function, auto

      vectorization fails. However, there is no option to use a simpler

      index function.<br>

      <br>

      Is it possible to make the SCEV pass more smart? Or would you

      strongly advise against such endeavor?<br>

      <br>

      Frank<br>

      <br>

      <br>

      On 30/10/13 21:16, Nadav Rotem wrote:<br>

    </div>

    <blockquote

      cite="mid:4BFE59B4-FF79-4006-8E63-866CE817A615@apple.com"

      type="cite"><br>

      <div>

        <div>On Oct 30, 2013, at 6:10 PM, Frank Winter <<a

            moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>

          wrote:</div>

        <br class="Apple-interchange-newline">

        <blockquote type="cite">the only option I see is to unroll the

          loop by hand. Since the array access is consecutive over 4

          loop iterations I gave it a try and unrolled the loop by a

          factor of 4.  Which gives the following array accesses:<br>

          <br>

          loop iter 0:<br>

          index_0 = 0   index_1 = 4<br>

          index_0 = 1   index_1 = 5<br>

          index_0 = 2   index_1 = 6<br>

          index_0 = 3   index_1 = 7<br>

          <br>

          loop iter 1:<br>

          index_0 = 8   index_1 = 12<br>

          index_0 = 9   index_1 = 13<br>

          index_0 = 10   index_1 = 14<br>

          index_0 = 11   index_1 = 15<br>

        </blockquote>

        <div><br>

        </div>

        <div>The SLP-vectorizer detects 8 stores, but it can’t prove

          that they are consecutive, so it moves on.  Can you simplify

          the address expression ?  Can you write " index0 = i*8 + 0 “

          and give it a try ?</div>

        <br>

        <blockquote type="cite"><br>

          For completeness, here the code:<br>

          <br>

          void bar(std::uint64_t start, std::uint64_t end, float *

          __restrict__  c, float * __restrict__ a, float * __restrict__

          b)<br>

          {<br>

           const std::uint64_t inner = 4;<br>

           for (std::uint64_t i = start ; i < end ; i+=4 ) {<br>

             {<br>

               const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) *

          inner + (i+0)%4;<br>

               const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) *

          inner + (i+0)%4;<br>

               c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];<br>

               c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];<br>

             }<br>

             {<br>

               const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) *

          inner + (i+1)%4;<br>

               const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) *

          inner + (i+1)%4;<br>

               c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];<br>

               c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];<br>

             }<br>

             {<br>

               const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) *

          inner + (i+2)%4;<br>

               const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) *

          inner + (i+2)%4;<br>

               c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];<br>

               c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];<br>

             }<br>

             {<br>

               const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) *

          inner + (i+3)%4;<br>

               const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) *

          inner + (i+3)%4;<br>

               c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];<br>

               c[ ir1 ]         = a[ ir1 ]         + b[ ir1 ];<br>

             }<br>

           }<br>

          }<br>

          <br>

          <br>

          This should be an ideal test case for the SLP vectorizer,

          right?<br>

          <br>

          It seems, I am out of luck:<br>

          <br>

          opt -O3 -vectorize-slp -debug loop.ll -S<br>

          <br>

          SLP: Analyzing blocks in _Z3barmmPfS_S_.<br>

          SLP: Found 8 stores to vectorize.<br>

          SLP: Analyzing a store chain of length 8.<br>

          SLP: Trying to vectorize starting at PHIs (1)<br>

          SLP: Vectorizing a list of length = 2.<br>

          SLP: Vectorizing a list of length = 2.<br>

          SLP: Vectorizing a list of length = 2.<br>

        </blockquote>

      </div>

      <br>

    </blockquote>

    <br>

    <br>

  </body>

</html>