[PATCH] D31239: [WIP] Add Caching of Known Bits in InstCombine

Wed Mar 29 17:45:45 PDT 2017

On Sat, Mar 25, 2017 at 11:22 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> On 03/25/2017 10:57 AM, Daniel Berlin wrote:
>
>
>
> On Sat, Mar 25, 2017 at 5:28 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 03/24/2017 06:00 PM, Daniel Berlin wrote:
>>
>>
>>
>> On Fri, Mar 24, 2017 at 3:44 PM, Daniel Berlin <dberlin at dberlin.org>
>> wrote:
>>
>>> In which i repeat my claim that trying to keep caches up to date is
>>> significantly harder to get right and verify than incremental recomputation
>>> :)
>>>
>>
>> and to be super clear here, the practical difference between the two is
>> that incremental computation is really "build an analysis that can be run
>> at any point, but is a full analysis that computes all answers and not on
>> demand. The speed at which the analysis runs depends on how much data has
>> changed since last time.".
>>
>>
>> How exactly do you propose this works in this case (given that the
>> recomputation cost is proportional to the amount the IR has changed)?
>>
>
> As i mentioned, first i would actually measure whether we really need to
> be recomputing that often *at all*.
> There is no point in either incrementally recomputing *or* computing on
> demand if it is not actually buying us significant generated code
> performance.
> The simplest thing to verify correct is the one that does not need to be
> updated at all :)
>
> Second,yes,  the recomputation cost is proportional in either case.
>
>
>> It sounds like caching with the default behavior on RAUW flipped.
>>
> In this case, it would be similar.
> Your caching, right now, has the following properties that are maybe
> easily changed:
> It is not a full analysis, it must be handled explicitly
> It has no verifier (you would think you could write one by comparing every
> value against computing a not cached result, but i think you will discover
> that does not work either, see below)
> It has no printer.
> The above are all things you suggested might be fixed anyway, so i'm going
> to leave them out except verification.
>
> Past that, the other explicit difference here is the approach to
> computation.
>
> On-demand cached computation here starts at "thing that i am missing data
> for", and  goes walking through the IR backwards to try to find the answer.
> It does not pay attention to computation order, etc.
> You start at the leaves, you end at the the root.
> That means it often does significant duplicated work (even if that work is
> wasted cache lookups and visits) or provably does not get a maximal
> fixpoint when making use of the cache.
>
> In this case, it is both ;)
>
> For example, for an operator, it computes the known bits for the LHS and
> RHS recursively.
> Without caching, it computes them repeatedly, visiting from use->def.
>
> Incremental recomputation would compute known-bits as a forwards
> propagation problem over the function.
> You start at the root, you end at the leaves.
> When you visit an operator, the LHS and RHS would already have been
> visited (if they are not part of a cycle), and thus you compute it *once*.
> This requires ordering the computation and propagation properly, mind you.
>
> Second, the caching version does not reach a maximal fixpoint:
>         // Recurse, but cap the recursion to one level, because we don't
>         // want to waste time spinning around in loops.
> This is because it is working backwards, on demand, and so phi computation
> sometimes cannot reach a fixpoint in a sane amount of time, so it gets
> cutoff.
> Solving the forwards problem properly, in the correct order, will *always*
> give the same answer from the same starting point, given the same set of
> invalidation.
>
> Note that this in turn affects verification.  Because the on-demand
> version has cut-offs that effectively depend on visitation depth, the
> cached version will get different answers depending on how much is cached,
> because it will need to go less deep to get a specific answer.
> This depends on visitation order.
> This in turn, makes it really hard to write a verifier :)  You can't just
> compared cached vs non-cached.  The cached may get better or worse answers
> depending on visitation order (incremental recomputation never does in this
> case, assuming you invalidate properly in both cases).
>
> I'm pretty sure *you* get this, but for those following along:
> Imagine you have a phi
> phi(a, b, c, d)
>
> Currently, while walking this phi, it will only recurse a certain depth to
> find an answer.  Let's pretend it's depth 1.
> Let's say given the initial order you call compute demanded bits in (you
> call it on the phi before a, b, c, d),  they are all too deep for it to
> find an answer, it can't visit deep enough to compute a, b, c and d
>
> Thus, if you start from *nothing*, the answer you get for the phi is:
> "unknown".
> Now, computing the phi is separate from computing the answer for a, b, c,
> d.
> So next, you visit a, b, c, and d, and compute their answers.
> It can successfully do so (because the depth limit applies to the phi, and
> even if it didn't, the depth limit for these guys is +1 what it was from
> the phi)
>
> Now, if you invalidate the phi in the cache (assume it's just an innocent
> victim and the value did not really change, it just "may have"), and ask it
> to compute the answer:
>
> low and behold, now it has the answers for a, b, c and d, and can give you
> an answer for the phi.
>
> So the cached version gave us a better answer.
>
> The real reason is because it got closer to the maximal fixpoint than the
> original version.
>
> But still, much harder to verify because the answer really is allowed to
> vary depending on the order in which you ask it for answers.
>
> The forwards incremental computation has exactly zero of these problems.
> It always computes the same answer.
> It only varies the time taken to do so.
>
>
>
>> Instead of assuming by default that the replacement instructions has the
>> same known bits as the original, we assume that it does not (and then
>> recursively clear out all users).
>>
>
> If they were computed in the same direction, with the same propagation
>  strategy, yes, i would agree.
>
>
> In general, I agree with all of this. To be clear, I posted this patch so
> that we could make some measurements of InstCombine with the known-bits
> computation cost stabilized.
>
Of course. I think the work is very valuable. I was more opposed to the
immediate response i saw from some others of "let's just take this and push
it in because it's our only real option".

> Given that we compute known bits on essentially every relevant
> instruction, doing this ahead of time makes perfect sense (i.e. aside from
> all of the disadvantages, we likely gain very little from the laziness).
>

>
> I do want to explicitly clarify the following point, however. In the
> analysis as you imagine it, what exactly do we do when we replace an IR
> value (or update an instruction's flags)?
>
In the current world, your only option is to have the optimization pass
notify the analysis, explicitly, through a call, that piece of IR "X" got
invalidated.
We could build something else, but i'd start there anyway, becasue the
exact mechanism through which the analysis learns of invalidation doesn't
change what it does internally.

Once it gets notification, it has two options:
1. It throws it on a list, and the next time you ask it for a result, it
does the recomputation (which presents the same interface as on-demand, but
isn't really the same internally)
2. It does the recomputation immediately.

Both are valid options.

It's trickier if you *also* invalidate the CFG, but that's handleable too.
i"m going to assume we aren't invalidating the CFG here for a second.

In practice, there is not a ton of differences between the two internally.
In *either* case, You shove the names of pieces of IR that got invalidated
onto a list, grow the list to  the bounds of what needs to be recomputed,
and do it.

For SSA based analysis that is single-pass, you can just invalidate the
users.
For SSA vased analysis that is optimistic/iterative, the truly easy
solution is to convert it to use SCC's of the def-use graph, which should
be amazingly trivial - if you could get a maximal result based on def-use
propagation, you will get a maximal result by iterating SCC's

1. Form SCCs of the SSA graph, tarjans will give you a topo ordering.
For each SCC in topo order, iterate your analysis on the SCC values until
it stops changing.
This is guaranteed to give a maximal result, and the condensed graph of
SCC's is acylic.

2.  When something is invalidated, you invalidate the results for the SCC,
reiterate the SCC.
If the result ends up same as it was before, you are done
If it changes, you invalidate the SCC's using values from the current SCC
(using the topo info), and recompute them.
Keep going along each path of using sccs in that order until result stops
changing in each path.
When you run out of paths, you are done.

This is of course, not the only way, and not even guaranteed to the fastest
it just makes it easy to do and verify, and is usually quite fast.

Note; If your value change modified the SCC structure, that too can be
updated fairly easily.

Thanks again,
> Hal
>
>
>
>
>>
>>  -Hal
>>
>> The compile time hit depends on how often you want answers.  Usually
>> people try to say "i want them always up to date" (often without any
>> quantification of cost or other trade-offs), which leads to trying to do it
>> on-demand
>>
>> Caching is really  "build something that is always computed on demand.
>> Try to make it faster by storing the answers it gives somewhere, and hoping
>> that the rest of the algorithm retains the ability to give correct answers
>> on demand when you remove some of the stored answers".
>>
>> One can obviously argue that incremental recomputation is a form of
>> caching, so i'm trying to be concrete about my definitions.
>>
>> usually, caching additionally requires clients be able to correctly
>> predict and remove the set of answers necessary for correctness (IE know
>> enough about how the algorithm works) but this isn't strictly the case if
>> designed well.
>>
>> Caching also often ends up with degraded answers over time, while most
>> full analysis can be made to not do so.  See, for example, memdep, which
>> has unresolvable cache invalidation issues.  This means, for example, opt
>> -gvn -gvn gives different answers than opt -gvn  | opt -gvn.
>>
>> Both approaches obviously require information about what has changed in
>> the IR.
>>
>>  I additionally claim that we should actually quantify the performance
>>> benefits of doing either over "recomputing once or twice during pipeline"
>>> before going down this route.
>>>
>>> We are *already* trading off compile time performance vs generated code
>>> performance.
>>>
>>> All other things being equal, we should trade it off in a way that makes
>>> maintenance of code and verification of correctness as easy as possible.
>>>
>>> Asserting that a good tradeoff to reduce compile time is "cache" instead
>>> of "stop doing it on demand" (or incrementally recompute), without any data
>>> other than compile time performance seems ... not particularly scientific.
>>>
>>> It's certainly true that caching is often posited as "easier", but i
>>> think a trip through bugzilla would put this idea to rest.
>>>
>>>
>>>
>>> On Fri, Mar 24, 2017 at 3:33 PM, Craig Topper via Phabricator <
>>> reviews at reviews.llvm.org> wrote:
>>>
>>>> craig.topper added a comment.
>>>>
>>>> I'm seeing more problems than just nsw/nuw flags here.
>>>>
>>>> sext.ll test is failing because SimplifyDemandedInstructions bits
>>>> determined that this
>>>>
>>>> %and = and i32 %x, 16
>>>>
>>>>   shl i32 %and, 27
>>>>
>>>> Simplified to just the shl because we were only demanding the MSB of
>>>> the shift. This occurred after we had cached the known bits for the shl as
>>>> having 31 lsbs as 0. But without the "and" in there we can no longer
>>>> guarantee the lower bits of the shift result are 0.
>>>>
>>>> I also got a failure on shift.ll not reported here. This was caused by
>>>> getShiftedValue rewriting operands and changing constants of other shifts.
>>>> This effectively shifts the Known bits and thus the cache needs to be
>>>> invalidate.
>>>>
>>>>
>>>> https://reviews.llvm.org/D31239
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170329/89c4c8b2/attachment.html>