[LLVMdev] RFC: Loop distribution/Partial vectorization

Tue Jan 13 16:24:51 PST 2015

On 01/12/2015 10:42 AM, Adam Nemet wrote:
> Hi,
>
> We'd like to propose new Loop Distribution pass.  The main motivation 
> is to
> allow partial vectorization of loops.  One such example is the main 
> loop of
> 456.hmmer in SpecINT_2006.  The current version of the patch improves 
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized 
> because of
> memory dependence cycles.  The pass splits the part with cycles into a 
> new loop
> making the remainder of the loop a candidate for vectorization.  E.g.:
>
>         for (k = 0; k < M; k++) {
>          S1:   MC[k+1] = …
>          // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>       => (Loop Distribute)
>         for (k = 0; k < M; k++) {
>          S1:   MC[k+1] = ...
>         }
>         for (k = 0; k < M; k++) {
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>       => (Loop Vectorize S1)
>         for (k = 0; k < M; k += 4) {
>          S1:   MC[k+1:k+5] = ...
>         }
>         for (k = 0; k < M; k++) {
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>
> I'd like to collect feedback on the design decisions made so far. 
>  These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before 
> the Loop
>   Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
>   Vectorizer.  This along with the AccessAnalysis class is split out 
> into a new
>   LoopAccessAnalysis class.  We may want to turn this into an analysis 
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop 
> Vectorizer.  The
>   hmmer loop requires memchecks.  This is again captured by the same
>   LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
>   - The list of unsafe memory dependencies is computed for the loop. 
>  Unsafe
>     means that the dependence may be part of a cycle (this is what the 
> current
>     framework provides).
>   - Partitions are created for each set of unsafe dependences.
>   - Partitions are created for each of the remaining stores not yet 
> encountered.
>     The order of the partitions preserve the original order of the 
> dependent
>     memory accesses.
>   - Simple partition merging is performed to minimize the number of 
> new loops.
>   - Partitions are populated with the other dependent instructions by 
> following
>     the SSA use-def chains and control dependence edges.
>   - Finally, the actual distribution is performed by creating a loop 
> for each
>     partition.  For each partition we clone the loop and remove all the
>     instructions that don't belong to the partition.
>   - Also, if run-time memory checks are necessary, these are emitted. 
>  We keep
>     an original version of the loop around to branch too if the checks 
> fail.
I like the general direction.  One potential concern I have is regards 
to distributing a loop which we turn out not to vectorize and 
potentially creating larger code for no clear benefit.  We'll have to 
see how this works in practice.
>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller 
> patches from
>   the current patch and completing them.  The final commit will enable 
> loop
>   distribution with a command-line flag or a loop hint.
I look forward to seeing your patches.  Getting this in incrementally 
will take some work on all sides, but is definitely better than trying 
to land one large patch.
>
> - Explore and fine-tune the proper cost model for loop distribution to 
> allow
>   partial vectorization.  This is essentially whether to partition and 
> what
>   these partitions should be.  Currently instructions are mapped to 
> partitions
>   using a simple heuristics to create a vectorizable partitions.  We 
> may need to
>   interact with the vectorizer to make sure the vectorization will 
> actually
>   happen and it will be overall profitable.
As I said above, this is my biggest area of concern.  It'll be 
interesting to see where you end up.
>
> - Explore other potentials for loop distribution, e.g.:
>   - Partial vectorization of loops that can't be if-converted
>   - Classic loop distribution to improve spatial locality
>   - Compute the Program Dependence Graph rather than the list of 
> unsafe memory
>     accesses and allow reordering of memory operations
>   - Distribute a loop in order to recognize parts as loop idioms
>
>     Long term, loop distribution could also become a transformation 
> utility
>     (Transform/Util).  That way, the loop transformation passes could 
> use it to
>     strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the 
> next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150113/98ee0939/attachment.html>