[llvm-dev] Generalizing load/store promotion in LICM

Thu Sep 13 14:00:59 PDT 2018

For context, I've been looking into replacing the use of AST
(AliasSetTracker) with MemorySSA in LICM. This works great for
sinking/hoisting but does not apply well for promotion, and one of the
solutions I considered is precisely ripping out the promotion part and
replacing its functionality with a separate PRE pass + possibly store
sinking. FWIW I think that's the right way to go.
I did not get deep enough into working on the solution but I would gladly
have a detailed discussion to move this forward.

Reading into detail now.

Thanks,
Alina

On Thu, Sep 13, 2018 at 1:43 PM Philip Reames <listmail at philipreames.com>
wrote:

> (minor inline additions)
>
> On 09/13/2018 01:51 AM, Chandler Carruth wrote:
>
> Haven't had time to dig into this, but wanted to add +Alina Sbirlea
> <asbirlea at google.com> to the thread as she has been working on promotion
> and other aspects of LICM for a long time here.
>
> Thanks!
>
> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at philipreames.com>
> wrote:
>
>> I'm thinking about making some semi radical changes to load store
>> promotion works in LICM, and figured it would be a good idea to get buy in
>> before I actually started writing code.  :)
>>
>> TLDR: legality of sinking stores to exits is hard, can we separate load
>> handling into a super aggressive form of PRE, and use predicated stores to
>> avoid solving legality question?
>>
>>
>> Background
>>
>> We've been seeing an interesting class of problems recently that looks
>> roughly like this:
>>
>> for (int = 0; i < N; i++)
>>   if (a[i] == 0) // some data dependent check
>>     g_count++; // conditional load and store to shared location
>>
>> The critical detail in this example is that g_count is a global location
>> which may be accessed concurrently* by multiple threads.  The challenge
>> here is that we don't know whether the store ever executes, and we're not
>> allowed to insert a store along any path that didn't originally contain
>> them.  Said differently, if all elements in "a" are non-zero, we're not
>> allowed to store to g_count.  We do know that the g_count location is
>> dereferenceable though.
>>
>> (*Please, let's avoid the memory model derailment here.  I'm simplifying
>> and C++ language rules aren't real useful for my Java language frontend
>> anyways.  In practice, all the access are atomic, but unordered, but we'll
>> leave that out of discussion otherwise.)
>>
>> I have two approaches I'm considering here.  These are orthogonal, but I
>> suspect we'll want to implement both.
>>
>>
>> Proposal A - Use predicated stores in loop exits
>>
>> The basic idea is that we don't worry about solving the legality question
>> above, and just insert a store which is predicated on a condition which is
>> true exactly when the original store ran.  In pseudo code, this looks
>> something like:
>>
>> bool StoreLegal = false;
>> int LocalCount = g_count;
>> for (int = 0; i < N; i++)
>>   if (a[i] == 0) {
>>     LocalCount++;
>>     StoreLegal = true;
>>   }
>> if (StoreLegal) g_count = LocalCount;
>>
>> There are two obvious concerns here:
>>
>>    1. The predicated store might be expensive in practice - true for
>>    most current architectures.
>>    2. We''re introducing an extra boolean phi cycle around the loop.
>>
>> Talking to a couple of folks offline at the socials over the last few
>> months, the former seems to be the main objection.  I think we can control
>> this by restricting this transform to when the loop is believed to have a
>> high trip count and the conditional path is taken with some frequency.
>> Getting feedback on this particular point is one of my main reasons for
>> writing this mail.
>>
>> The second objection can frequently be resolved by finding other
>> expressions which evaluate to the same boolean.  (In this case, if
>> LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already have
>> a framework with SCEV to do these replacements.  Though, from some quick
>> testing, it would definitely need strengthening.  However, SCEV can't
>> remove the extra phi in all cases, so we have to be okay with the extra phi
>> cycle in the general case.  This seems worthwhile to me, but thoughts?
>>
>>
>> Proposal B - Separate load and store handling into distinct transforms
>>
>> (For folks I've discussed this with before, this part is all new.)
>>
>> Thinking about the problem, it finally occurred to me that we can
>> decompose the original example into two steps: getting the loads out of the
>> loop, and sinking the stores out of the loop.  If we can accomplish the
>> former, but not the later, we've still made meaningful progress.
>>
>> So, what'd we'd essentially have is a load only transformation which
>> produces this:
>> int LocalCount = g_count;
>> for (int = 0; i < N; i++)
>>   if (a[i] == 0) {
>>     LocalCount++;
>>     g_count = LocalCount;
>>   }
>>
>> At this point, we've reduced memory traffic by half, and opened up the
>> possibility that other parts of the optimizer can exploit the simplified
>> form.  The last point is particular interesting since we generally try to
>> canonicalize loads out of loops, and much of the optimizer is tuned for a
>> form with as much as possible being loop invariant.  As one example,
>> completely by accident, there's some work going on in the LoopVectorizer
>> right now to handle such stores to loop invariant addresses during
>> vectorization.  Putting the two pieces together would let us fully
>> vectorize this loop without needing to sink the stores at all.
>>
>> In practice, the existing implementation in LICM would cleanly split
>> along these lines with little problem.
>>
>> One way of looking at this load specific transform is as an extreme form
>> of PRE (partial redundancy elimination).  Our current PRE implementation
>> doesn't handle cases this complicated.
>>
> It occurred to my later that simply framing the new transform as a
> separate pass (LoopPRE) and using the same AST + SSA construction approach
> would be straight forward.  So, if folks think that having an aggressive
> form of load PRE in LICM is going a bit too far, it'd be easy to represent
> as an optional separate pass.  I'd still prefer having LICM contain the
> logic though.
>
> Thoughts?
>>
>> Philip
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/efa66c72/attachment.html>