<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-cite-prefix">On 01/12/2015 10:42 AM, Adam Nemet
wrote:<br>
</div>
<blockquote
cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<div class="">Hi,</div>
<div class=""><br class="">
</div>
<div class="">We'd like to propose new Loop Distribution pass.
The main motivation is to</div>
<div class="">allow partial vectorization of loops. One such
example is the main loop of</div>
<div class="">456.hmmer in SpecINT_2006. The current version of
the patch improves hmmer by</div>
<div class="">24% on ARM64 and 18% on X86.</div>
<div class=""><br class="">
</div>
<div class="">The goal of the pass is to distribute a loop that
can't be vectorized because of</div>
<div class="">memory dependence cycles. The pass splits the part
with cycles into a new loop</div>
<div class="">making the remainder of the loop a candidate for
vectorization. E.g.:</div>
<div class=""><br class="">
</div>
<div class=""> for (k = 0; k < M; k++) {</div>
<div class=""> S1: MC[k+1] = …</div>
<div class="">
<div class=""> // Cycle in S2 due to DC[k+1] -> DC[k]
loop-carried dependence</div>
</div>
<div class=""> S2: DC[k+1] = DC[k], MC[k]
</div>
<div class=""> }</div>
<div class=""> </div>
<div class=""> => (Loop Distribute)</div>
<div class=""> </div>
<div class=""> for (k = 0; k < M; k++) {</div>
<div class=""> S1: MC[k+1] = ...</div>
<div class=""> }</div>
<div class=""> for (k = 0; k < M; k++) {</div>
<div class=""> S2: DC[k+1] = DC[k], MC[k]</div>
<div class=""> }</div>
<div class=""> </div>
<div class=""> => (Loop Vectorize S1)</div>
<div class=""> </div>
<div class=""> for (k = 0; k < M; k += 4) {</div>
<div class=""> S1: MC[k+1:k+5] = ...</div>
<div class=""> }</div>
<div class=""> for (k = 0; k < M; k++) {</div>
<div class=""> S2: DC[k+1] = DC[k], MC[k]</div>
<div class=""> }</div>
<div class=""><br class="">
</div>
<div class="">I'd like to collect feedback on the design decisions
made so far. These are</div>
<div class="">implemented in a proof-of-concept patch (<a
moz-do-not-send="true" href="http://reviews.llvm.org/D6930"
class="">http://reviews.llvm.org/D6930</a>).</div>
<div class="">Here is the list of design choices:</div>
<div class=""><br class="">
</div>
<div class="">- Loop Distribution is implemented as a separate
pass to be run before the Loop</div>
<div class=""> Vectorizer.</div>
<div class=""><br class="">
</div>
<div class="">- The pass reuses the Memory Dependence Checker
framework from the Loop</div>
<div class=""> Vectorizer. This along with the AccessAnalysis
class is split out into a new</div>
<div class=""> LoopAccessAnalysis class. We may want to turn
this into an analysis pass on its own.</div>
<div class=""><br class="">
</div>
<div class="">- It also reuses the Run-time Memory Check code from
the Loop Vectorizer. The</div>
<div class=""> hmmer loop requires memchecks. This is again
captured by the same</div>
<div class=""> LoopAccessAnalysis class.</div>
<div class=""><br class="">
</div>
<div class="">- The actual loop distribution is implemented as
follows:</div>
<div class=""><br class="">
</div>
<div class=""> - The list of unsafe memory dependencies is
computed for the loop. Unsafe</div>
<div class=""> means that the dependence may be part of a cycle
(this is what the current</div>
<div class=""> framework provides).</div>
<div class=""> - Partitions are created for each set of unsafe
dependences.</div>
<div class=""> - Partitions are created for each of the remaining
stores not yet encountered.</div>
<div class=""> The order of the partitions preserve the
original order of the dependent</div>
<div class=""> memory accesses.</div>
<div class=""> - Simple partition merging is performed to
minimize the number of new loops.</div>
<div class=""> - Partitions are populated with the other
dependent instructions by following</div>
<div class=""> the SSA use-def chains and control dependence
edges.</div>
<div class=""> - Finally, the actual distribution is performed by
creating a loop for each</div>
<div class=""> partition. For each partition we clone the loop
and remove all the</div>
<div class=""> instructions that don't belong to the partition.</div>
<div class=""> - Also, if run-time memory checks are necessary,
these are emitted. We keep</div>
<div class=""> an original version of the loop around to branch
too if the checks fail.</div>
</blockquote>
I like the general direction. One potential concern I have is
regards to distributing a loop which we turn out not to vectorize
and potentially creating larger code for no clear benefit. We'll
have to see how this works in practice. <br>
<blockquote
cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"
type="cite">
<div class=""><br class="">
</div>
<div class="">My plan is to proceed with the following steps:</div>
<div class=""><br class="">
</div>
<div class="">- Bring the current functionality to trunk by
splitting off smaller patches from</div>
<div class=""> the current patch and completing them. The final
commit will enable loop</div>
<div class=""> distribution with a command-line flag or a loop
hint.</div>
</blockquote>
I look forward to seeing your patches. Getting this in
incrementally will take some work on all sides, but is definitely
better than trying to land one large patch.<br>
<blockquote
cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"
type="cite">
<div class=""><br class="">
</div>
<div class="">- Explore and fine-tune the proper cost model for
loop distribution to allow</div>
<div class=""> partial vectorization. This is essentially
whether to partition and what</div>
<div class=""> these partitions should be. Currently
instructions are mapped to partitions</div>
<div class=""> using a simple heuristics to create a vectorizable
partitions. We may need to</div>
<div class=""> interact with the vectorizer to make sure the
vectorization will actually</div>
<div class=""> happen and it will be overall profitable.</div>
</blockquote>
As I said above, this is my biggest area of concern. It'll be
interesting to see where you end up.<br>
<blockquote
cite="mid:74967B9E-D5CC-4E8B-BC77-70B9F60DD48E@apple.com"
type="cite">
<div class=""><br class="">
</div>
<div class="">- Explore other potentials for loop distribution,
e.g.:</div>
<div class=""> - Partial vectorization of loops that can't be
if-converted</div>
<div class=""> - Classic loop distribution to improve spatial
locality</div>
<div class=""> - Compute the Program Dependence Graph rather than
the list of unsafe memory</div>
<div class=""> accesses and allow reordering of memory
operations</div>
<div class=""> - Distribute a loop in order to recognize parts as
loop idioms</div>
<div class=""><br class="">
</div>
<div class=""> Long term, loop distribution could also become a
transformation utility</div>
<div class=""> (Transform/Util). That way, the loop
transformation passes could use it to</div>
<div class=""> strip the loop from parts that inhibits the
given optimization.</div>
<div class=""><br class="">
</div>
<div class="">Please let me know if you have feedback either on
the design or on the next</div>
<div class="">steps.</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class="">Adam</div>
<div class=""><br class="">
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a class="moz-txt-link-freetext" href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a>
<a class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a>
</pre>
</blockquote>
<br>
</body>
</html>