[llvm-dev] Generalizing load/store promotion in LICM

Alina Sbirlea via llvm-dev llvm-dev at lists.llvm.org
Tue Sep 18 15:10:15 PDT 2018


On Fri, Sep 14, 2018 at 4:39 PM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I agree this approach is possible.  I even have an (ancient, abandoned)
> patch which started down that path.  https://reviews.llvm.org/D7061
>
> I view this as being also useful, not a reason not to do the transform in
> LICM.  We've already spent all the analysis cost, we might as well use it.
>
> (Lest it's not clear, I'm assuming that if we did do a separate LoopPRE
> pass that we'd have to separate out a AliasSetTrackAnalysis and handle
> invalidation properly though the loop pass pipeline.  i.e. not recompute
> the AST)
>

I was suggesting dropping the use of the AST entirely :). Not going to
happen tomorrow, but at some point...


>
> On 09/14/2018 04:56 AM, John Brawn wrote:
>
> For doing PRE on the load, it looks like there’s only two things stopping
> GVN PRE from doing it:
>
> * GVN::PerformLoadPRE doesn’t hoist loads that are conditional. Probably
> this can be overcome with some kind of
>
>      heuristic that allows it to happen in loops when the blocks where a
> load would have to be inserted are outside
>
>      the loop.
>
> * IsFullyAvailableInBlock goes around the loop and double-counts the
> entry-edge into the loop as unavailable. I’m
>
>      pretty sure this can be easily fixed by marking the block that
> PerformLoadPRE calls IsFullyAvailableInBlock on as
>
>      FullyAvailable, so it stops there when following the back-edge.
>
> I did a quick experiment (solving the first problem by just commenting out
> that check) and this worked for a simple
>
> test case of
>
> int x;
>
> int arr[32];
>
> void fn(int n) {
>
>    for (int i = 0; i < n; i++) {
>
>      if (arr[i]) {
>
>        x++;
>
>      }
>
>    }
>
> }
>
> So perhaps the load PRE half can be done without a separate pass.
>
>
>
> John
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org
> <llvm-dev-bounces at lists.llvm.org>] *On Behalf Of *Alina Sbirlea via
> llvm-dev
> *Sent:* 13 September 2018 22:01
> *To:* listmail at philipreames.com
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] Generalizing load/store promotion in LICM
>
>
>
> For context, I've been looking into replacing the use of AST
> (AliasSetTracker) with MemorySSA in LICM. This works great for
> sinking/hoisting but does not apply well for promotion, and one of the
> solutions I considered is precisely ripping out the promotion part and
> replacing its functionality with a separate PRE pass + possibly store
> sinking. FWIW I think that's the right way to go.
>
> I did not get deep enough into working on the solution but I would gladly
> have a detailed discussion to move this forward.
>
>
>
> Reading into detail now.
>
>
>
> Thanks,
>
> Alina
>
>
>
> On Thu, Sep 13, 2018 at 1:43 PM Philip Reames <listmail at philipreames.com>
> wrote:
>
> (minor inline additions)
>
> On 09/13/2018 01:51 AM, Chandler Carruth wrote:
>
> Haven't had time to dig into this, but wanted to add +Alina Sbirlea
> <asbirlea at google.com> to the thread as she has been working on promotion
> and other aspects of LICM for a long time here.
>
> Thanks!
>
> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at philipreames.com>
> wrote:
>
> I'm thinking about making some semi radical changes to load store
> promotion works in LICM, and figured it would be a good idea to get buy in
> before I actually started writing code.  :)
>
> TLDR: legality of sinking stores to exits is hard, can we separate load
> handling into a super aggressive form of PRE, and use predicated stores to
> avoid solving legality question?
>
>
>
> Background
>
> We've been seeing an interesting class of problems recently that looks
> roughly like this:
>
> for (int = 0; i < N; i++)
>   if (a[i] == 0) // some data dependent check
>     g_count++; // conditional load and store to shared location
>
> The critical detail in this example is that g_count is a global location
> which may be accessed concurrently* by multiple threads.  The challenge
> here is that we don't know whether the store ever executes, and we're not
> allowed to insert a store along any path that didn't originally contain
> them.  Said differently, if all elements in "a" are non-zero, we're not
> allowed to store to g_count.  We do know that the g_count location is
> dereferenceable though.
>
> (*Please, let's avoid the memory model derailment here.  I'm simplifying
> and C++ language rules aren't real useful for my Java language frontend
> anyways.  In practice, all the access are atomic, but unordered, but we'll
> leave that out of discussion otherwise.)
>
> I have two approaches I'm considering here.  These are orthogonal, but I
> suspect we'll want to implement both.
>
>
>
> Proposal A - Use predicated stores in loop exits
>
> The basic idea is that we don't worry about solving the legality question
> above, and just insert a store which is predicated on a condition which is
> true exactly when the original store ran.  In pseudo code, this looks
> something like:
>
> bool StoreLegal = false;
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     StoreLegal = true;
>   }
> if (StoreLegal) g_count = LocalCount;
>
> There are two obvious concerns here:
>
>    1. The predicated store might be expensive in practice - true for most
>    current architectures.
>    2. We''re introducing an extra boolean phi cycle around the loop.
>
> Talking to a couple of folks offline at the socials over the last few
> months, the former seems to be the main objection.  I think we can control
> this by restricting this transform to when the loop is believed to have a
> high trip count and the conditional path is taken with some frequency.
> Getting feedback on this particular point is one of my main reasons for
> writing this mail.
>
> The second objection can frequently be resolved by finding other
> expressions which evaluate to the same boolean.  (In this case, if
> LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already have
> a framework with SCEV to do these replacements.  Though, from some quick
> testing, it would definitely need strengthening.  However, SCEV can't
> remove the extra phi in all cases, so we have to be okay with the extra phi
> cycle in the general case.  This seems worthwhile to me, but thoughts?
>
>
>
> Proposal B - Separate load and store handling into distinct transforms
>
> (For folks I've discussed this with before, this part is all new.)
>
> Thinking about the problem, it finally occurred to me that we can
> decompose the original example into two steps: getting the loads out of the
> loop, and sinking the stores out of the loop.  If we can accomplish the
> former, but not the later, we've still made meaningful progress.
>
> So, what'd we'd essentially have is a load only transformation which
> produces this:
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     g_count = LocalCount;
>   }
>
> At this point, we've reduced memory traffic by half, and opened up the
> possibility that other parts of the optimizer can exploit the simplified
> form.  The last point is particular interesting since we generally try to
> canonicalize loads out of loops, and much of the optimizer is tuned for a
> form with as much as possible being loop invariant.  As one example,
> completely by accident, there's some work going on in the LoopVectorizer
> right now to handle such stores to loop invariant addresses during
> vectorization.  Putting the two pieces together would let us fully
> vectorize this loop without needing to sink the stores at all.
>
> In practice, the existing implementation in LICM would cleanly split along
> these lines with little problem.
>
> One way of looking at this load specific transform is as an extreme form
> of PRE (partial redundancy elimination).  Our current PRE implementation
> doesn't handle cases this complicated.
>
> It occurred to my later that simply framing the new transform as a
> separate pass (LoopPRE) and using the same AST + SSA construction approach
> would be straight forward.  So, if folks think that having an aggressive
> form of load PRE in LICM is going a bit too far, it'd be easy to represent
> as an optional separate pass.  I'd still prefer having LICM contain the
> logic though.
>
> Thoughts?
>
> Philip
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180918/5631b097/attachment-0001.html>


More information about the llvm-dev mailing list