[llvm-dev] Generalizing load/store promotion in LICM

Chandler Carruth via llvm-dev llvm-dev at lists.llvm.org
Thu Sep 13 01:51:29 PDT 2018


Haven't had time to dig into this, but wanted to add +Alina Sbirlea
<asbirlea at google.com> to the thread as she has been working on promotion
and other aspects of LICM for a long time here.

On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at philipreames.com>
wrote:

> I'm thinking about making some semi radical changes to load store
> promotion works in LICM, and figured it would be a good idea to get buy in
> before I actually started writing code.  :)
>
> TLDR: legality of sinking stores to exits is hard, can we separate load
> handling into a super aggressive form of PRE, and use predicated stores to
> avoid solving legality question?
>
>
> Background
>
> We've been seeing an interesting class of problems recently that looks
> roughly like this:
>
> for (int = 0; i < N; i++)
>   if (a[i] == 0) // some data dependent check
>     g_count++; // conditional load and store to shared location
>
> The critical detail in this example is that g_count is a global location
> which may be accessed concurrently* by multiple threads.  The challenge
> here is that we don't know whether the store ever executes, and we're not
> allowed to insert a store along any path that didn't originally contain
> them.  Said differently, if all elements in "a" are non-zero, we're not
> allowed to store to g_count.  We do know that the g_count location is
> dereferenceable though.
>
> (*Please, let's avoid the memory model derailment here.  I'm simplifying
> and C++ language rules aren't real useful for my Java language frontend
> anyways.  In practice, all the access are atomic, but unordered, but we'll
> leave that out of discussion otherwise.)
>
> I have two approaches I'm considering here.  These are orthogonal, but I
> suspect we'll want to implement both.
>
>
> Proposal A - Use predicated stores in loop exits
>
> The basic idea is that we don't worry about solving the legality question
> above, and just insert a store which is predicated on a condition which is
> true exactly when the original store ran.  In pseudo code, this looks
> something like:
>
> bool StoreLegal = false;
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     StoreLegal = true;
>   }
> if (StoreLegal) g_count = LocalCount;
>
> There are two obvious concerns here:
>
>    1. The predicated store might be expensive in practice - true for most
>    current architectures.
>    2. We''re introducing an extra boolean phi cycle around the loop.
>
> Talking to a couple of folks offline at the socials over the last few
> months, the former seems to be the main objection.  I think we can control
> this by restricting this transform to when the loop is believed to have a
> high trip count and the conditional path is taken with some frequency.
> Getting feedback on this particular point is one of my main reasons for
> writing this mail.
>
> The second objection can frequently be resolved by finding other
> expressions which evaluate to the same boolean.  (In this case, if
> LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already have
> a framework with SCEV to do these replacements.  Though, from some quick
> testing, it would definitely need strengthening.  However, SCEV can't
> remove the extra phi in all cases, so we have to be okay with the extra phi
> cycle in the general case.  This seems worthwhile to me, but thoughts?
>
>
> Proposal B - Separate load and store handling into distinct transforms
>
> (For folks I've discussed this with before, this part is all new.)
>
> Thinking about the problem, it finally occurred to me that we can
> decompose the original example into two steps: getting the loads out of the
> loop, and sinking the stores out of the loop.  If we can accomplish the
> former, but not the later, we've still made meaningful progress.
>
> So, what'd we'd essentially have is a load only transformation which
> produces this:
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     g_count = LocalCount;
>   }
>
> At this point, we've reduced memory traffic by half, and opened up the
> possibility that other parts of the optimizer can exploit the simplified
> form.  The last point is particular interesting since we generally try to
> canonicalize loads out of loops, and much of the optimizer is tuned for a
> form with as much as possible being loop invariant.  As one example,
> completely by accident, there's some work going on in the LoopVectorizer
> right now to handle such stores to loop invariant addresses during
> vectorization.  Putting the two pieces together would let us fully
> vectorize this loop without needing to sink the stores at all.
>
> In practice, the existing implementation in LICM would cleanly split along
> these lines with little problem.
>
> One way of looking at this load specific transform is as an extreme form
> of PRE (partial redundancy elimination).  Our current PRE implementation
> doesn't handle cases this complicated.
>
> Thoughts?
>
> Philip
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/111e4291/attachment.html>


More information about the llvm-dev mailing list