[llvm-dev] Generalizing load/store promotion in LICM
Philip Reames via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 13 13:42:53 PDT 2018
(minor inline additions)
On 09/13/2018 01:51 AM, Chandler Carruth wrote:
> Haven't had time to dig into this, but wanted to add +Alina Sbirlea
> <mailto:asbirlea at google.com> to the thread as she has been working on
> promotion and other aspects of LICM for a long time here.
Thanks!
> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
> I'm thinking about making some semi radical changes to load store
> promotion works in LICM, and figured it would be a good idea to
> get buy in before I actually started writing code. :)
>
> TLDR: legality of sinking stores to exits is hard, can we separate
> load handling into a super aggressive form of PRE, and use
> predicated stores to avoid solving legality question?
>
>
> Background
>
> We've been seeing an interesting class of problems recently that
> looks roughly like this:
>
> for (int = 0; i < N; i++)
> if (a[i] == 0) // some data dependent check
> g_count++; // conditional load and store to shared location
>
> The critical detail in this example is that g_count is a global
> location which may be accessed concurrently* by multiple threads.
> The challenge here is that we don't know whether the store ever
> executes, and we're not allowed to insert a store along any path
> that didn't originally contain them. Said differently, if all
> elements in "a" are non-zero, we're not allowed to store to
> g_count. We do know that the g_count location is dereferenceable
> though.
>
> (*Please, let's avoid the memory model derailment here. I'm
> simplifying and C++ language rules aren't real useful for my Java
> language frontend anyways. In practice, all the access are
> atomic, but unordered, but we'll leave that out of discussion
> otherwise.)
>
> I have two approaches I'm considering here. These are orthogonal,
> but I suspect we'll want to implement both.
>
>
> Proposal A - Use predicated stores in loop exits
>
> The basic idea is that we don't worry about solving the legality
> question above, and just insert a store which is predicated on a
> condition which is true exactly when the original store ran. In
> pseudo code, this looks something like:
>
> bool StoreLegal = false;
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
> if (a[i] == 0) {
> LocalCount++;
> StoreLegal = true;
> }
> if (StoreLegal) g_count = LocalCount;
>
> There are two obvious concerns here:
>
> 1. The predicated store might be expensive in practice - true for
> most current architectures.
> 2. We''re introducing an extra boolean phi cycle around the loop.
>
> Talking to a couple of folks offline at the socials over the last
> few months, the former seems to be the main objection. I think we
> can control this by restricting this transform to when the loop is
> believed to have a high trip count and the conditional path is
> taken with some frequency. Getting feedback on this particular
> point is one of my main reasons for writing this mail.
>
> The second objection can frequently be resolved by finding other
> expressions which evaluate to the same boolean. (In this case, if
> LocalCount != LocalCountOrig assuming i doesn't overflow.) We
> already have a framework with SCEV to do these replacements.
> Though, from some quick testing, it would definitely need
> strengthening. However, SCEV can't remove the extra phi in all
> cases, so we have to be okay with the extra phi cycle in the
> general case. This seems worthwhile to me, but thoughts?
>
>
> Proposal B - Separate load and store handling into distinct transforms
>
> (For folks I've discussed this with before, this part is all new.)
>
> Thinking about the problem, it finally occurred to me that we can
> decompose the original example into two steps: getting the loads
> out of the loop, and sinking the stores out of the loop. If we
> can accomplish the former, but not the later, we've still made
> meaningful progress.
>
> So, what'd we'd essentially have is a load only transformation
> which produces this:
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
> if (a[i] == 0) {
> LocalCount++;
> g_count = LocalCount;
> }
>
> At this point, we've reduced memory traffic by half, and opened up
> the possibility that other parts of the optimizer can exploit the
> simplified form. The last point is particular interesting since
> we generally try to canonicalize loads out of loops, and much of
> the optimizer is tuned for a form with as much as possible being
> loop invariant. As one example, completely by accident, there's
> some work going on in the LoopVectorizer right now to handle such
> stores to loop invariant addresses during vectorization. Putting
> the two pieces together would let us fully vectorize this loop
> without needing to sink the stores at all.
>
> In practice, the existing implementation in LICM would cleanly
> split along these lines with little problem.
>
> One way of looking at this load specific transform is as an
> extreme form of PRE (partial redundancy elimination). Our current
> PRE implementation doesn't handle cases this complicated.
>
It occurred to my later that simply framing the new transform as a
separate pass (LoopPRE) and using the same AST + SSA construction
approach would be straight forward. So, if folks think that having an
aggressive form of load PRE in LICM is going a bit too far, it'd be easy
to represent as an optional separate pass. I'd still prefer having LICM
contain the logic though.
>
> Thoughts?
>
> Philip
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/c383000f/attachment.html>
More information about the llvm-dev
mailing list