[LLVMdev] RFC: Loop distribution/Partial vectorization
Philip Reames
listmail at philipreames.com
Tue Jan 13 16:24:51 PST 2015
On 01/12/2015 10:42 AM, Adam Nemet wrote:
> Hi,
>
> We'd like to propose new Loop Distribution pass. The main motivation
> is to
> allow partial vectorization of loops. One such example is the main
> loop of
> 456.hmmer in SpecINT_2006. The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles. The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization. E.g.:
>
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = …
> // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
> S2: DC[k+1] = DC[k], MC[k]
> }
> => (Loop Distribute)
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
> => (Loop Vectorize S1)
> for (k = 0; k < M; k += 4) {
> S1: MC[k+1:k+5] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
>
> I'd like to collect feedback on the design decisions made so far.
> These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before
> the Loop
> Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
> Vectorizer. This along with the AccessAnalysis class is split out
> into a new
> LoopAccessAnalysis class. We may want to turn this into an analysis
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer. The
> hmmer loop requires memchecks. This is again captured by the same
> LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
> - The list of unsafe memory dependencies is computed for the loop.
> Unsafe
> means that the dependence may be part of a cycle (this is what the
> current
> framework provides).
> - Partitions are created for each set of unsafe dependences.
> - Partitions are created for each of the remaining stores not yet
> encountered.
> The order of the partitions preserve the original order of the
> dependent
> memory accesses.
> - Simple partition merging is performed to minimize the number of
> new loops.
> - Partitions are populated with the other dependent instructions by
> following
> the SSA use-def chains and control dependence edges.
> - Finally, the actual distribution is performed by creating a loop
> for each
> partition. For each partition we clone the loop and remove all the
> instructions that don't belong to the partition.
> - Also, if run-time memory checks are necessary, these are emitted.
> We keep
> an original version of the loop around to branch too if the checks
> fail.
I like the general direction. One potential concern I have is regards
to distributing a loop which we turn out not to vectorize and
potentially creating larger code for no clear benefit. We'll have to
see how this works in practice.
>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller
> patches from
> the current patch and completing them. The final commit will enable
> loop
> distribution with a command-line flag or a loop hint.
I look forward to seeing your patches. Getting this in incrementally
will take some work on all sides, but is definitely better than trying
to land one large patch.
>
> - Explore and fine-tune the proper cost model for loop distribution to
> allow
> partial vectorization. This is essentially whether to partition and
> what
> these partitions should be. Currently instructions are mapped to
> partitions
> using a simple heuristics to create a vectorizable partitions. We
> may need to
> interact with the vectorizer to make sure the vectorization will
> actually
> happen and it will be overall profitable.
As I said above, this is my biggest area of concern. It'll be
interesting to see where you end up.
>
> - Explore other potentials for loop distribution, e.g.:
> - Partial vectorization of loops that can't be if-converted
> - Classic loop distribution to improve spatial locality
> - Compute the Program Dependence Graph rather than the list of
> unsafe memory
> accesses and allow reordering of memory operations
> - Distribute a loop in order to recognize parts as loop idioms
>
> Long term, loop distribution could also become a transformation
> utility
> (Transform/Util). That way, the loop transformation passes could
> use it to
> strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the
> next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150113/98ee0939/attachment.html>
More information about the llvm-dev
mailing list