<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix"><tt>A quite small but yet complete

        example function which all vectorization passes fail to

        optimize:</tt><tt><br>

      </tt><tt><br>

      </tt><tt>#include <cstdint></tt><tt><br>

      </tt><tt>#include <iostream></tt><tt><br>

      </tt><tt><br>

      </tt><tt>void bar(std::uint64_t start, std::uint64_t end, float *

        __restrict__  c, float * __restrict__ a, float * __restrict__ b)</tt><tt><br>

      </tt><tt>{</tt><tt><br>

      </tt><tt>  for ( std::uint64_t i = start ; i < end ; i += 4 ) {</tt><tt><br>

      </tt><tt>    {</tt><tt><br>

      </tt><tt>      const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);</tt><tt><br>

      </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];</tt><tt><br>

      </tt><tt>    }</tt><tt><br>

      </tt><tt>    {</tt><tt><br>

      </tt><tt>      const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);</tt><tt><br>

      </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];</tt><tt><br>

      </tt><tt>    }</tt><tt><br>

      </tt><tt>    {</tt><tt><br>

      </tt><tt>      const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);</tt><tt><br>

      </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];</tt><tt><br>

      </tt><tt>    }</tt><tt><br>

      </tt><tt>    {</tt><tt><br>

      </tt><tt>      const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);</tt><tt><br>

      </tt><tt>      c[ ir0 ]         = a[ ir0 ]         + b[ ir0 ];</tt><tt><br>

      </tt><tt>    }</tt><tt><br>

      </tt><tt>  }</tt><tt><br>

      </tt><tt>} </tt><tt><br>

      </tt><tt><br>

      </tt><tt>The loop index and array accesses for the first 4

        iterations:</tt><tt><br>

      </tt><tt><br>

      </tt><tt>i 0:     0 1 2 3 </tt><tt><br>

      </tt><tt>i 4:     8 9 10 11 </tt><tt><br>

      </tt><tt>i 8:     16 17 18 19 </tt><tt><br>

      </tt><tt>i 12:     24 25 26 27<br>

      </tt><br>

      <pre class="bz_comment_text" id="comment_text_1">For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11

opt -O3 -vectorize-slp -S loop_minimal.ll

opt -O3 -loop-vectorize -S loop_minimal.ll

opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.

How can the SCEV AA pass be extended to handle this type of arithmetic?

</pre>

      <tt>Frank<br>

      </tt><br>

      <pre class="bz_comment_text" id="comment_text_1">

On 31/10/13 02:21, Renato Golin wrote:

</pre>

    </div>

    <blockquote

cite="mid:CAMSE1ke-_kjwTfWop3h86C133o=6QAHivCzAUYkwd6kuzqq5Sw@mail.gmail.com"

      type="cite">

      <div dir="ltr">On 30 October 2013 18:40, Frank Winter <<a

          moz-do-not-send="true" href="mailto:fwinter@jlab.org">fwinter@jlab.org</a>>

        wrote:<br>

        <div class="gmail_extra">

          <div class="gmail_quote">

            <blockquote class="gmail_quote">

              <div bgcolor="#FFFFFF" text="#000000">

                <div>      const std::uint64_t ir0 = (i+0)%4;  // not

                  working<br>

                </div>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>I thought this would be the case when I saw the

              original expression. Maybe we need to teach module

              arithmetic to SCEV?</div>

          </div>

          <br>

        </div>

        <div class="gmail_extra">--renato</div>

      </div>

    </blockquote>

    <br>

    <br>

  </body>

</html>