<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 01/12/2015 10:42 AM, Adam Nemet

      wrote:<br>

    </div>

    <blockquote

      cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <div class="">Hi,</div>

      <div class=""><br class="">

      </div>

      <div class="">We'd like to propose new Loop Distribution pass.

         The main motivation is to</div>

      <div class="">allow partial vectorization of loops.  One such

        example is the main loop of</div>

      <div class="">456.hmmer in SpecINT_2006.  The current version of

        the patch improves hmmer by</div>

      <div class="">24% on ARM64 and 18% on X86.</div>

      <div class=""><br class="">

      </div>

      <div class="">The goal of the pass is to distribute a loop that

        can't be vectorized because of</div>

      <div class="">memory dependence cycles.  The pass splits the part

        with cycles into a new loop</div>

      <div class="">making the remainder of the loop a candidate for

        vectorization.  E.g.:</div>

      <div class=""><br class="">

      </div>

      <div class="">        for (k = 0; k < M; k++) {</div>

      <div class="">         S1:   MC[k+1] = …</div>

      <div class="">

        <div class="">         // Cycle in S2 due to DC[k+1] -> DC[k]

          loop-carried dependence</div>

      </div>

      <div class="">         S2:   DC[k+1] = DC[k], MC[k]              

                               </div>

      <div class="">        }</div>

      <div class="">  </div>

      <div class="">      => (Loop Distribute)</div>

      <div class="">  </div>

      <div class="">        for (k = 0; k < M; k++) {</div>

      <div class="">         S1:   MC[k+1] = ...</div>

      <div class="">        }</div>

      <div class="">        for (k = 0; k < M; k++) {</div>

      <div class="">         S2:   DC[k+1] = DC[k], MC[k]</div>

      <div class="">        }</div>

      <div class="">  </div>

      <div class="">      => (Loop Vectorize S1)</div>

      <div class="">  </div>

      <div class="">        for (k = 0; k < M; k += 4) {</div>

      <div class="">         S1:   MC[k+1:k+5] = ...</div>

      <div class="">        }</div>

      <div class="">        for (k = 0; k < M; k++) {</div>

      <div class="">         S2:   DC[k+1] = DC[k], MC[k]</div>

      <div class="">        }</div>

      <div class=""><br class="">

      </div>

      <div class="">I'd like to collect feedback on the design decisions

        made so far.  These are</div>

      <div class="">implemented in a proof-of-concept patch (<a

          moz-do-not-send="true" href="http://reviews.llvm.org/D6930"

          class="">http://reviews.llvm.org/D6930</a>).</div>

      <div class="">Here is the list of design choices:</div>

      <div class=""><br class="">

      </div>

      <div class="">- Loop Distribution is implemented as a separate

        pass to be run before the Loop</div>

      <div class="">  Vectorizer.</div>

      <div class=""><br class="">

      </div>

      <div class="">- The pass reuses the Memory Dependence Checker

        framework from the Loop</div>

      <div class="">  Vectorizer.  This along with the AccessAnalysis

        class is split out into a new</div>

      <div class="">  LoopAccessAnalysis class.  We may want to turn

        this into an analysis pass on its own.</div>

      <div class=""><br class="">

      </div>

      <div class="">- It also reuses the Run-time Memory Check code from

        the Loop Vectorizer.  The</div>

      <div class="">  hmmer loop requires memchecks.  This is again

        captured by the same</div>

      <div class="">  LoopAccessAnalysis class.</div>

      <div class=""><br class="">

      </div>

      <div class="">- The actual loop distribution is implemented as

        follows:</div>

      <div class=""><br class="">

      </div>

      <div class="">  - The list of unsafe memory dependencies is

        computed for the loop.  Unsafe</div>

      <div class="">    means that the dependence may be part of a cycle

        (this is what the current</div>

      <div class="">    framework provides).</div>

      <div class="">  - Partitions are created for each set of unsafe

        dependences.</div>

      <div class="">  - Partitions are created for each of the remaining

        stores not yet encountered.</div>

      <div class="">    The order of the partitions preserve the

        original order of the dependent</div>

      <div class="">    memory accesses.</div>

      <div class="">  - Simple partition merging is performed to

        minimize the number of new loops.</div>

      <div class="">  - Partitions are populated with the other

        dependent instructions by following</div>

      <div class="">    the SSA use-def chains and control dependence

        edges.</div>

      <div class="">  - Finally, the actual distribution is performed by

        creating a loop for each</div>

      <div class="">    partition.  For each partition we clone the loop

        and remove all the</div>

      <div class="">    instructions that don't belong to the partition.</div>

      <div class="">  - Also, if run-time memory checks are necessary,

        these are emitted.  We keep</div>

      <div class="">    an original version of the loop around to branch

        too if the checks fail.</div>

    </blockquote>

    I like the general direction.  One potential concern I have is

    regards to distributing a loop which we turn out not to vectorize

    and potentially creating larger code for no clear benefit.  We'll

    have to see how this works in practice.  <br>

    <blockquote

      cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"

      type="cite">

      <div class=""><br class="">

      </div>

      <div class="">My plan is to proceed with the following steps:</div>

      <div class=""><br class="">

      </div>

      <div class="">- Bring the current functionality to trunk by

        splitting off smaller patches from</div>

      <div class="">  the current patch and completing them.  The final

        commit will enable loop</div>

      <div class="">  distribution with a command-line flag or a loop

        hint.</div>

    </blockquote>

    I look forward to seeing your patches.  Getting this in

    incrementally will take some work on all sides, but is definitely

    better than trying to land one large patch.<br>

    <blockquote

      cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"

      type="cite">

      <div class=""><br class="">

      </div>

      <div class="">- Explore and fine-tune the proper cost model for

        loop distribution to allow</div>

      <div class="">  partial vectorization.  This is essentially

        whether to partition and what</div>

      <div class="">  these partitions should be.  Currently

        instructions are mapped to partitions</div>

      <div class="">  using a simple heuristics to create a vectorizable

        partitions.  We may need to</div>

      <div class="">  interact with the vectorizer to make sure the

        vectorization will actually</div>

      <div class="">  happen and it will be overall profitable.</div>

    </blockquote>

    As I said above, this is my biggest area of concern.  It'll be

    interesting to see where you end up.<br>

    <blockquote

      cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"

      type="cite">

      <div class=""><br class="">

      </div>

      <div class="">- Explore other potentials for loop distribution,

        e.g.:</div>

      <div class="">  - Partial vectorization of loops that can't be

        if-converted</div>

      <div class="">  - Classic loop distribution to improve spatial

        locality</div>

      <div class="">  - Compute the Program Dependence Graph rather than

        the list of unsafe memory</div>

      <div class="">    accesses and allow reordering of memory

        operations</div>

      <div class="">  - Distribute a loop in order to recognize parts as

        loop idioms</div>

      <div class=""><br class="">

      </div>

      <div class="">    Long term, loop distribution could also become a

        transformation utility</div>

      <div class="">    (Transform/Util).  That way, the loop

        transformation passes could use it to</div>

      <div class="">    strip the loop from parts that inhibits the

        given optimization.</div>

      <div class=""><br class="">

      </div>

      <div class="">Please let me know if you have feedback either on

        the design or on the next</div>

      <div class="">steps.</div>

      <div class=""><br class="">

      </div>

      <div class="">Thanks,</div>

      <div class="">Adam</div>

      <div class=""><br class="">

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a class="moz-txt-link-freetext" href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a>

<a class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>