<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Hi Nadav,<br>

      <br>

      that's the whole point of it. I can't in general make the index

      calculation simpler. The example given is the simplest non-trivial

      index function that is needed. It might well be that it's that

      simple that the index calculation in this case can be thrown aways

      altogether and - as you say - be replaced by the simple loop you

      mentioned. However, this cannot be done in the general case. In

      the general case the index calculation requires the 'rem' and

      'div' instruction. The OR instruction must be the result of an

      arithmetic transformation of one of the previous passes (due to

      -O3?).<br>

      <br>

      I don't see a way around to make the vectorizers recognize such

      arithmetic operations.<br>

      <br>

      Do you think this can be done relatively quickly or does this

      involve a huge effort?<br>

      <br>

      What does 'stepping through' the loop vectorizer mean? Using the

      debugger and step through the program? Probably not. Is the way to

      go, to alter the 'canVectorize' function print debug output to see

      what's going on?<br>

      <br>

      Hal, you seem to know the loop vectorizer. Is this the place to

      look at or is the SLP vectorizer the more promising place?<br>

      <br>

      Frank<br>

      <br>

      <br>

      On 31/10/13 12:50, Nadav Rotem wrote:<br>

    </div>

    <blockquote

      cite="mid:916FD825-98E5-4E99-9D1F-5FB46F0CEEF0@apple.com"

      type="cite">

      <div>Hi Frank, </div>

      <div><br>

      </div>

      <div>This loop should be vectorized by the SLP-vectorizer.  It has

        several scalars (C[0], C[1] … ) that can be merged into a

        vector.  The SLP vectorizer can’t figure out that the stores are

        consecutive because SCEV can’t analyze the OR in the index

        calculation:</div>

      <div> </div>

      <div>  %2 = and i64 %i.04, 3<br>

          %3 = lshr i64 %i.04, 2<br>

          %4 = shl i64 %3, 3<br>

          %5 = or i64 %4, %2<br>

          %11 = getelementptr inbounds float* %c, i64 %5<br>

          store float %10, float* %11, align 4, !tbaa !0<br>

        <br>

      </div>

      <div>You wrote that you want each iteration to look like this:</div>

      <div><br>

      </div>

      <div>

        <blockquote type="cite">

          <div bgcolor="#FFFFFF" text="#000000">

            <div class="moz-cite-prefix"><tt><br>

              </tt><tt>i 0:     0 1 2 3 </tt><tt><br>

              </tt><tt>i 4:     8 9 10 11 </tt><tt><br>

              </tt><tt>i 8:     16 17 18 19 </tt><tt><br>

              </tt><tt>i 12:     24 25 26 27<br>

              </tt></div>

          </div>

        </blockquote>

      </div>

      <div><br>

      </div>

      <div>Why can’t you just write a small loop like this:   for (i=0;

        i<4; i++) C[i] = A[i] + B[i]  ??    Either the unroller will

        unroll it and the SLP-vectorizer will vectorize the unrolled

        iterations, or the loop-vectorizer would catch it. </div>

      <div><br>

      </div>

      <div>Thanks,</div>

      <div>Nadav</div>

      <br>

      <div>

        <div>On Oct 31, 2013, at 8:01 AM, Frank Winter <<a

            moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>

          wrote:</div>

        <br class="Apple-interchange-newline">

        <blockquote type="cite">

          <div bgcolor="#FFFFFF" text="#000000">

            <div class="moz-cite-prefix"><tt>A quite small but yet

                complete example function which all vectorization passes

                fail to optimize:</tt><tt><br>

              </tt><tt><br>

              </tt><tt>#include <cstdint></tt><tt><br>

              </tt><tt>#include <iostream></tt><tt><br>

              </tt><tt><br>

              </tt><tt>void bar(std::uint64_t start, std::uint64_t end,

                float * __restrict__  c, float * __restrict__ a, float *

                __restrict__ b)</tt><tt><br>

              </tt><tt>{</tt><tt><br>

              </tt><tt>  for ( std::uint64_t i = start ; i < end ; i

                += 4 ) {</tt><tt><br>

              </tt><tt>    {</tt><tt><br>

              </tt><tt>      const std::uint64_t ir0 = (i+0)%4 +

                8*((i+0)/4);</tt><tt><br>

              </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[

                ir0 ];</tt><tt><br>

              </tt><tt>    }</tt><tt><br>

              </tt><tt>    {</tt><tt><br>

              </tt><tt>      const std::uint64_t ir0 = (i+1)%4 +

                8*((i+1)/4);</tt><tt><br>

              </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[

                ir0 ];</tt><tt><br>

              </tt><tt>    }</tt><tt><br>

              </tt><tt>    {</tt><tt><br>

              </tt><tt>      const std::uint64_t ir0 = (i+2)%4 +

                8*((i+2)/4);</tt><tt><br>

              </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[

                ir0 ];</tt><tt><br>

              </tt><tt>    }</tt><tt><br>

              </tt><tt>    {</tt><tt><br>

              </tt><tt>      const std::uint64_t ir0 = (i+3)%4 +

                8*((i+3)/4);</tt><tt><br>

              </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[

                ir0 ];</tt><tt><br>

              </tt><tt>    }</tt><tt><br>

              </tt><tt>  }</tt><tt><br>

              </tt><tt>} </tt><tt><br>

              </tt><tt><br>

              </tt><tt>The loop index and array accesses for the first 4

                iterations:</tt><tt><br>

              </tt><tt><br>

              </tt><tt>i 0:     0 1 2 3 </tt><tt><br>

              </tt><tt>i 4:     8 9 10 11 </tt><tt><br>

              </tt><tt>i 8:     16 17 18 19 </tt><tt><br>

              </tt><tt>i 12:     24 25 26 27<br>

              </tt><br>

              <pre class="bz_comment_text" id="comment_text_1">For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11

opt -O3 -vectorize-slp -S loop_minimal.ll

opt -O3 -loop-vectorize -S loop_minimal.ll

opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.

How can the SCEV AA pass be extended to handle this type of arithmetic?

</pre>

              <tt>Frank<br>

              </tt><br>

              <pre class="bz_comment_text" id="comment_text_1">On 31/10/13 02:21, Renato Golin wrote:

</pre>

            </div>

            <blockquote

cite="mid:CAMSE1ke-_kjwTfWop3h86C133o=6QAHivCzAUYkwd6kuzqq5Sw@mail.gmail.com"

              type="cite">

              <div dir="ltr">On 30 October 2013 18:40, Frank Winter <<a

                  moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>

                wrote:<br>

                <div class="gmail_extra">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote">

                      <div bgcolor="#FFFFFF" text="#000000">

                        <div>      const std::uint64_t ir0 = (i+0)%4; 

                          // not working<br>

                        </div>

                      </div>

                    </blockquote>

                    <div><br>

                    </div>

                    <div>I thought this would be the case when I saw the

                      original expression. Maybe we need to teach module

                      arithmetic to SCEV?</div>

                  </div>

                  <br>

                </div>

                <div class="gmail_extra">--renato</div>

              </div>

            </blockquote>

            <br>

            <br>

          </div>

        </blockquote>

      </div>

      <br>

    </blockquote>

    <br>

    <br>

  </body>

</html>