[llvm-dev] LICM as canonical form

Fri Dec 3 15:27:43 PST 2021

Later today, I'm going to be reverting D87551.  I first raised serious 
concern on said review back in Oct, but this is a bit of an unusual case 
because the change landed roughly a year before that.

This patch introduced a profile driven heuristic to selectively disable 
hoisting of instructions out of loops.  By doing so, it changes a long 
standing design element without broad consensus following discussion on 
llvm-dev.  However, this email isn't really about the revert per se.

In the course of the discussion leading to this point, I realized we 
didn't really have a cite-able resource describing the historical 
design.  This email is an attempt to provide that, and to highlight some 
of the issues which need addressed if we do decide we want to change it.

*LICM as canonical form*

We have for many years treated hoisting instructions out of loops as a 
canonical form.  That is, hoisting is not done because it is profitable 
(though it often is), but is instead done so that other parts of the 
optimizer can rely on it.

We assume that an unprofitable hoist will be undone. Historically, we 
have generally assumed this to be done in the backend, but more 
recently, LoopSink has also been added towards the end of the IR 
pipeline with the same goal.

*Why does this matter?*

Other transforms depend on us having hoisted instructions out of loops 
for effectiveness.  The largest source of such assumptions is that SCEV 
is unable to compute trip counts for any exit condition involving a loop 
varying load.  Almost all of our loop transformations depend on SCEVs 
trip count logic, so failing to hoist an otherwise hoistable load is a 
severe pessimization.

For illustration purposes, consider this toy example:

   for (int i = 0; ; i++) {
     sum += a[i] + *b;
     length = a.length;
     if (i >= length) break;
   }

This example involves a typical for-loop for which the exit test depends 
on a loop varying load.

Here's a couple examples:

 1. In unrolling, the form above is not unrollable.  The trip count is
    unknowable.  We might be able to use profile information to do a
    bounded full unroll if this loop is short running, but all other
    forms of unrolling (exact full, partial, and runtime) will be
    impossible.
 2. In the vectorizer, we will be unable to establish a trip count, and
    thus will not vectorize.  Additionally, even if we can compute a
    trip count, the cost model handling for uniformity depends on hoisting.

Other impacts worth noting

 1. In loop idiom recognize, we will fail to recognize most counted
    idioms (e.g. popcount, cttz).  Additionally, things like memset
    recognition will not happen if the value being stored was hoistable,
    but not hoisted.
 2. Our ability to analyze dominating conditions (e.g. cvp,
    valuetracking, SCEV's isKnownPredicateAt) will all be crippled by
    the inability to recognize values are loop invariant.  When the RHS
    of a comparison is a potentially different value every time it runs,
    it really limits our ability to derive useful knowledge from that
    comparison or cross correlate comparisons.

*But what about an unprofitable hoist?*

There are examples where hoisting is not profitable.  Here's one such 
example:

   for (int i = 0; i < N; i++) {
     if (dynamically_always_taken) continue;
     sum += a[i];
     length = a.length;
     if (i >= length) break;
   }

Our general posture has been that we will perform hoisting in the middle 
end, and then undo that hoisting if needed later in the pass pipeline.  
The basic reason for that is that it is nearly impossible to distinguish 
profitable from unprofitable cases because the profitability of the 
transform depends too heavily on which following transforms might run.

Here's a small example which might at first seem unprofitable - inspired 
by the patch being reverted - but where hoisting is in fact the far more 
profitable outcome.

   for (int i = 0; i < N; i++) {
     i8* addr = a;
     if (invariant_cond_usually_false) {
        // very, very rare block e.g. 1 in 100 million
        addr = a + 1;
     }
     *addr = 0
   }

Subtly, this example should be profitable to hoist even if the rare 
condition is not invariant.  We still know this loop writes to at most 
two memory locations.  While we might not exploit that fact today, an 
extended form of load-store promotion could do so. If we don't hoist the 
addressing expression under the rare branch, we can not (in general) 
determine that at most two locations are written.

I will note that LoopSink appears to be a bit restricted in practice.  
Someone with unprofitable examples could reasonably push this much further.

*How would we change this?*

I want to be very explicit about saying this design is only one 
reasonable design.  It would also be entirely reasonable to build a 
design around a profit driven LICM.  That's simply not what we have 
today.  The remainder of this section is about expanding on the work 
which would need to be undertaken to make such a change.

First, we would need a clear set of examples where LICM was truly 
unprofitable.  These examples would need to be publicly accessible.  
They would also need to be fairly minimal.  In particular, there must 
not be other obvious optimizations which if implemented makes the 
hoisting profitable after all.

Second, we would need a proposal to llvm-dev which directly engages with 
the fact that SCEV (and thus most of our loop passes) depends on having 
loads hoisted for analysis quality.  We could build a mechanism in SCEV 
to model possible load hoisting.  There are some tricky bits in doing 
so, but it should be possible in theory.

The main problem with modeling possible hoists in SCEV is the need to 
query both memory analysis and fault legality efficiently.  Figuring out 
how to make that available for all users of SCEV without introducing 
nasty invalidation bugs or degenerate compile times is a hard design 
problem, but might be feasible.  (I think that MemorySSA gives an 
interesting building block here, but have not deeply considered this.)

There's also an API design problem in making sure that analysis results 
can't be consumed without committing to the hoisting in the IR.  A 
transform which e.g. assumes a trip count without hoisting the relevant 
load would be subtly incorrect.

Third, a consensus must be built that the resulting additional 
complexity for the mechanism build to address the previous point is 
worthwhile for the project as a whole.  This will be a judgement call 
and would depend heavily on the solution chosen for the previous point.

Finally, to be explicit, I am using the SCEV use case by way of an 
example only.  There may be other ways that we depend on hoisting as a 
canonical form that I did not happen to think of when writing this 
email.  The burden is on the person or person proposing a change to 
identify any other dependencies, and to convince the community they have 
done so.  Discussion of a testing strategy to find those dependencies 
should be a first class concern in any proposal.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211203/4a41d395/attachment.html>