[LLVMdev] RFC: Loop distribution/Partial vectorization

Tue Jan 27 12:18:34 PST 2015

----- Original Message -----
> From: "Adam Nemet" <anemet at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Sunday, January 18, 2015 3:24:32 AM
> Subject: Re: [LLVMdev] RFC: Loop distribution/Partial vectorization
> 
> On Jan 17, 2015, at 6:29 AM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> ----- Original Message -----
> 
> 
> From: "Adam Nemet" < anemet at apple.com >
> To: "LLVM Developers Mailing List" < llvmdev at cs.uiuc.edu >
> Sent: Monday, January 12, 2015 12:42:36 PM
> Subject: [LLVMdev] RFC: Loop distribution/Partial vectorization
> 
> 
> 
> Hi,
> 
> 
> We'd like to propose new Loop Distribution pass. The main motivation
> is to
> allow partial vectorization of loops. One such example is the main
> loop of
> 456.hmmer in SpecINT_2006. The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
> 
> Thanks for working on this! We definitely need this capability in
> LLVM (and for more than just enabling vectorization).
> 
> 
> 
> 
> 
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles. The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization. E.g.:
> 
> 
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = …
> 
> // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> => (Loop Distribute)
> 
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> => (Loop Vectorize S1)
> 
> for (k = 0; k < M; k += 4) {
> S1: MC[k+1:k+5] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> 
> I'd like to collect feedback on the design decisions made so far.
> These are
> implemented in a proof-of-concept patch (
> http://reviews.llvm.org/D6930 ).
> Here is the list of design choices:
> 
> 
> - Loop Distribution is implemented as a separate pass to be run
> before the Loop
> Vectorizer.
> 
> 
> - The pass reuses the Memory Dependence Checker framework from the
> Loop
> Vectorizer. This along with the AccessAnalysis class is split out
> into a new
> LoopAccessAnalysis class. We may want to turn this into an analysis
> pass on its own.
> 
> This is good. It would be nice if this new analysis could have the
> same interface as the DependenceAnalysis pass, so that it will be
> easy to switch between them. I think that, eventually, we'll want to
> switch everything to use something like DependenceAnalysis, at least
> at higher optimization levels.
> 
> 
> 
> Yes, that is precisely what Arnold and I discussed a few weeks ago.
> We want to reuse something that's known to work initially to get us
> off the ground but then we want to be to swap in the
> DependenceAnalysis. The idea was exactly as you describe it: to try
> to change the interface of the Memory Dependence Checker to match
> the interface of the DependenceAnalysis pass.
> 
> 
> 
> 
> 
> 
> 
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer. The
> hmmer loop requires memchecks. This is again captured by the same
> LoopAccessAnalysis class.
> 
> I think this is also reasonable; we just want to make sure that we
> don't end up with double memory checks. I've seen cases in the past
> where the vectorizer has inserted checks that should have been
> eliminated as duplicates with other loop guards, SE guard domination
> checking may need to be improved.
> 
> 
> 
> Yes, this was also on my list. (Sorry. I didn’t include everything in
> the original post because it would have been way too long).
> 
> 
> I didn’t know we already have code that tries to deal with this. Can
> you please point me to it?
> 

We don't exactly. JumpThreading could handle the constant-range cases, but that also leaves a lot on the table (and we don't run it after loop vectorization regardless). What we do have is SE's isLoopEntryGuardedByCond, which I thought was used by the loop vectorizer to avoid adding unnecessary guards, but I don't see that now, so I might be wrong.

> 
> - The actual loop distribution is implemented as follows:
> 
> 
> - The list of unsafe memory dependencies is computed for the loop.
> Unsafe
> means that the dependence may be part of a cycle (this is what the
> current
> framework provides).
> - Partitions are created for each set of unsafe dependences.
> - Partitions are created for each of the remaining stores not yet
> encountered.
> The order of the partitions preserve the original order of the
> dependent
> memory accesses.
> - Simple partition merging is performed to minimize the number of new
> loops.
> - Partitions are populated with the other dependent instructions by
> following
> the SSA use-def chains and control dependence edges.
> - Finally, the actual distribution is performed by creating a loop
> for each
> partition. For each partition we clone the loop and remove all the
> instructions that don't belong to the partition.
> - Also, if run-time memory checks are necessary, these are emitted.
> We keep
> an original version of the loop around to branch too if the checks
> fail.
> 
> This sounds reasonable.
> 
> 
> 
> 
> 
> My plan is to proceed with the following steps:
> 
> 
> - Bring the current functionality to trunk by splitting off smaller
> patches from
> the current patch and completing them. The final commit will enable
> loop
> distribution with a command-line flag or a loop hint.
> 
> Okay, please do.
> 
> 
> 
> 
> 
> - Explore and fine-tune the proper cost model for loop distribution
> to allow
> partial vectorization. This is essentially whether to partition and
> what
> these partitions should be. Currently instructions are mapped to
> partitions
> using a simple heuristics to create a vectorizable partitions. We may
> need to
> interact with the vectorizer to make sure the vectorization will
> actually
> happen and it will be overall profitable.
> 
> I think this sounds reasonable. Splitting to enable vectorization is
> important; one reason to have this process tightly integrated with
> vectorization is so that it can properly integrate with the
> vectorizers register pressure checking (we might split to reduce
> register pressure, thus enabling more interleaving, at least when
> doing so does not decrease spatial locality).
> 
> 
> 
> OK, I haven’t thought of splitting due to register pressure. I guess
> this makes sense both in vectorizable and non-vectorizable loops.
> 
> 
> Do you have an example for the part in parentheses? Do you mean that
> spatial locality would be decreased by interleaving?

Not off hand. The interleaving factor is limited by the number of available registers. I recall running across loops that have a lot of phi values, and those limit the number of registers available for the intermediate values required by the interleaving. If the loop can be split, reducing the number of registers needed for phis, that can increase the number of interleavings possible.

Thanks again,
Hal

> 
> Independent of vectorization, loop splitting is important to reduce
> register pressure within loops (i.e. loops with too many phis, but
> that are splittable, could be split to prevent intra-iteration
> spilling). Also very important is splitting to reduce the number of
> hardware prefetching streams used by the loop. In every system on
> which I've worked, the hardware prefetchers have a finite set of
> resources to sustain prefetching streams (5-10 per thread, depending
> on the architecture). When a loop would require more streams than
> this then performance will greatly suffer, and splitting it highly
> profitable. I'd definitely like us to hit these two use cases too.
> 
> 
> 
> Sure. I think that this is essentially what I meant by loop
> distribution to improve spatial locality. Exposing the target’s
> parameters for the HW prefetcher sounds like a nice way to model
> this.
> 
> 
> 
> 
> 
> 
> 
> - Explore other potentials for loop distribution, e.g.:
> - Partial vectorization of loops that can't be if-converted
> - Classic loop distribution to improve spatial locality
> - Compute the Program Dependence Graph rather than the list of unsafe
> memory
> accesses and allow reordering of memory operations
> 
> This would also be quite nice to have.
> 
> 
> 
> - Distribute a loop in order to recognize parts as loop idioms
> 
> Indeed, once you have the partitions, splitting out a memcpy, etc.
> should not be hard; this is not always profitable, however.
> 
> 
> 
> 
> 
> Long term, loop distribution could also become a transformation
> utility
> (Transform/Util). That way, the loop transformation passes could use
> it to
> strip the loop from parts that inhibits the given optimization.
> 
> This sounds good, but we still may want to schedule the
> transformation itself (late in the pipeline). We don't want to limit
> register-pressure-induced loop splitting, for example, to
> vectorizable loops.
> 
> 
> 
> Sure. I meant that there would still be a “stand-alone” loop
> distribution pass which would be another user of this transformation
> utility.
> 
> 
> Thanks very much for your feedback, Hal!
> 
> 
> Adam
> 
> 
> 
> Thanks again,
> Hal
> 
> 
> 
> 
> 
> Please let me know if you have feedback either on the design or on
> the next
> steps.
> 
> 
> Thanks,
> Adam
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory