[LLVMdev] RFC: Loop distribution/Partial vectorization

Krzysztof Parzyszek kparzysz at codeaurora.org
Tue Jan 13 08:27:09 PST 2015


Thanks for doing this. I really like the idea of having loop 
distribution as a separate pass (and having dependence analysis as an 
analysis pass).

A couple of comments that I have are below.

1. Handle situations like this:

     for (k = 0; k < M; k++) {
       for (i = 0; i < N; ++i) {
S1:     MC[i][k+1] = …
       }
S2:   DC[k+1] = DC[k], MC[…][k]
     }

Basically, recognize and handle dependencies between differently nested 
expressions.


2. Make it general so that it can serve any purpose (not only 
vectorization).  Various targets may want to do different things and 
distribute loops for various reasons that don't apply universally.


-Krzysztof



On 1/12/2015 12:42 PM, Adam Nemet wrote:
> Hi,
>
> We'd like to propose new Loop Distribution pass.  The main motivation is to
> allow partial vectorization of loops.  One such example is the main loop of
> 456.hmmer in SpecINT_2006.  The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles.  The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization.  E.g.:
>
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = …
>           // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Distribute)
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Vectorize S1)
>          for (k = 0; k < M; k += 4) {
>           S1:   MC[k+1:k+5] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>
> I'd like to collect feedback on the design decisions made so far.  These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before
> the Loop
>    Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
>    Vectorizer.  This along with the AccessAnalysis class is split out
> into a new
>    LoopAccessAnalysis class.  We may want to turn this into an analysis
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer.  The
>    hmmer loop requires memchecks.  This is again captured by the same
>    LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
>    - The list of unsafe memory dependencies is computed for the loop.
>   Unsafe
>      means that the dependence may be part of a cycle (this is what the
> current
>      framework provides).
>    - Partitions are created for each set of unsafe dependences.
>    - Partitions are created for each of the remaining stores not yet
> encountered.
>      The order of the partitions preserve the original order of the
> dependent
>      memory accesses.
>    - Simple partition merging is performed to minimize the number of new
> loops.
>    - Partitions are populated with the other dependent instructions by
> following
>      the SSA use-def chains and control dependence edges.
>    - Finally, the actual distribution is performed by creating a loop
> for each
>      partition.  For each partition we clone the loop and remove all the
>      instructions that don't belong to the partition.
>    - Also, if run-time memory checks are necessary, these are emitted.
>   We keep
>      an original version of the loop around to branch too if the checks
> fail.
>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller
> patches from
>    the current patch and completing them.  The final commit will enable loop
>    distribution with a command-line flag or a loop hint.
>
> - Explore and fine-tune the proper cost model for loop distribution to allow
>    partial vectorization.  This is essentially whether to partition and what
>    these partitions should be.  Currently instructions are mapped to
> partitions
>    using a simple heuristics to create a vectorizable partitions.  We
> may need to
>    interact with the vectorizer to make sure the vectorization will actually
>    happen and it will be overall profitable.
>
> - Explore other potentials for loop distribution, e.g.:
>    - Partial vectorization of loops that can't be if-converted
>    - Classic loop distribution to improve spatial locality
>    - Compute the Program Dependence Graph rather than the list of unsafe
> memory
>      accesses and allow reordering of memory operations
>    - Distribute a loop in order to recognize parts as loop idioms
>
>      Long term, loop distribution could also become a transformation utility
>      (Transform/Util).  That way, the loop transformation passes could
> use it to
>      strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>


-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation




More information about the llvm-dev mailing list