[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

Fri Aug 4 11:41:26 PDT 2017

On 8/4/2017 2:06 PM, Daniel Berlin wrote:
> A few notes:
> I'm a bit surprised IPO copy/constant propagation doesn't get this 
> case, but i didn't look if the lattice supports variables.
> In particular, in your example, given no other call sites, it should 
> eliminate the dead code.
> (In a real program, it may require cloning).

In the actual program (SPEC2017/gcc, ironically), there are multiple 
calls to fn2 and only one of them has the property that the 1st and 2nd 
argument are the same (as is shown in my pseudo code). Internally, we 
have another developer, Matt Simpson, working on a function 
specialization patch that might be of value here. Specifically, you 
could clone fn2 based on the fact that a_ptr == dst_ptr and then 
simplify a great deal of the function.  However, that patch is still a WIP.

> GCC will do IPA-CP/const-prop with cloning, and i'm wildly curious if 
> new GCC's catch this case for you at higher optimization levels?

GCC does inline fn2 into fn1 in this particular case, but I'm not 
exactly sure how GCC accomplishes this.  I'm guessing GCC is just more 
aggressive with its inlining (fn2 is also marked with the inline 
keyword, which I assume GCC uses as a hint).  I'm speculating here and 
I've never worked on GCC, so unfortunately I have little to go on.

> If so, it may be worth not looking at this as an inlining problem, but 
> as an area we need IPO infrastructure improvement

Because of the multiple callsites with varying characteristics I'm not 
sure this can be solved in this way.

>
> Otherwise,  a couple things:
> Approximate dominators (for example, semi-dominators) can be computed 
> fast (a DFS walk of the CFG with no real additional computation)
> Except in strange CFGs that jump around a lot, they are the dominators.
>
> More importantly, the dominator is either the sdom or a proper 
> ancestor of the sdom.
>
> The practical impact of this is that if you use them as if they were 
> dominators, the set of conditions you discover will not be "too 
> wrong".  Occasionally wrong, but mostly not.
>
> My guess is the cost of doing approximate dominators is ~50-60% of the 
> cost of doing dominators.  Nowadays, about half the time was in the 
> DFS walk, the other half in the computation. At best, it would be 2-3x 
> faster.
> I've no idea if this changes whether we'd want dominators, approximate 
> dominators, or stick with nothing.

Right, this is kinda one of the bigger questions I'm trying to figure 
out.  My proposed solution doesn't use the dominator tree in order to 
minimize the impact on compile-time.  However, I'd guess the ROI is 
going to be much smaller because of the limited scope. On the other end 
of the spectrum I'd fear the full dominator tree would be too 
computationally expensive (but of course some of that could be mitigated 
by the ability to do incremental updates to the dominator tree).

>
> If you have some sort of dominatorish tree could then just use 
> earlycse's method of dominating condition finding:
> Process in "dom tree" top-down order, push the equivalences you see, 
> pop the relevant ones when you exit the relevant dom tree scope.
>
> In practice, you'd only check comparisons against the hash table.

Humm.. I'll have to think about it for a bit.  I'm thinking this might 
be a good compromise for my needs.

>
> The other option is PredicateInfo, but it requires dominators and 
> modifies the IR.
>
> My guess is this is undesired/too heavyweight for inline cost 
> analysis, however the basic principle on how it renames things could 
> also be applied without IR changing for this specific case.  Unlike 
> the EarlyCSE method, which is O(all instructons) PredicateInfo is 
> O(branches + number of uses of variables affected by conditions) 
>  Without going into futher details, if all you care about is "for each 
> condition, give me the set of possibly affected variables" (so you can 
> see if they may simplify), we could do that very very quickly (as fast 
> as we can sort a vector). But it does require dominators.

For my particular problem, I think PredicateInfo would be sufficient 
IIUYC.  But as you suggest, I'm thinking people aren't going to be fond 
of using the full dominators.

Lots of great feedback.  Thanks, Danny.

>
>
>
> On Fri, Aug 4, 2017 at 8:56 AM, Chad Rosier <mcrosier at codeaurora.org 
> <mailto:mcrosier at codeaurora.org>> wrote:
>
>     All,
>
>     I'm working on an improvement to the inline cost model, but I'm
>     unsure how to proceed.  Let me begin by first describing the
>     problem I'm trying to solve. Consider the following pseudo C code:
>
>     *typedef struct element {
>       unsigned idx;
>     } element_t;
>     *
>
>     *static inline
>     unsigned char fn2 (element_t *dst_ptr, const element_t *a_ptr,
>                        const element_t *b_ptr, unsigned char changed) {
>       if (a_ptr && b_ptr && a_ptr->idx == b_ptr->idx) {
>         if (!changed && dst_ptr && dst_ptr->idx == a_ptr->idx) {
>           /* Do something */
>         } else {
>           changed = 1;
>           if (!dst_ptr)
>             dst_ptr = fn3();
>           else
>             dst_ptr->idx = a_ptr->idx;
>           /* Do something. */
>         }
>       } else {
>         changed = fn4();
>       }
>       return changed;
>     }
>
>     unsigned char fn1 (element_t *a_ptr, element_t *b_ptr) {
>       unsigned char changed = 0;
>       while (b_ptr) {
>         if (!a_ptr || a_ptr->idx == b_ptr->idx) {
>           changed = fn2 (a_ptr, a_ptr, b_ptr, changed);
>           b_ptr = b_ptr->next;
>         }
>       }
>       return changed;
>     }*
>
>     When the inline cost model computes the inline cost of fn2 it ends
>     up being much higher than the inline threshold.  A fair amount of
>     the cost is due to the inlining of fn3 into fn2.  However, if fn2
>     had been inlined into fn1 the code from fn3 would have been
>     removed as dead/unreachable.
>
>     At the fn2 call site notice the first two arguments are equal. 
>     Thus, in the context of fn2 dst_ptr and a_ptr are equal.  The call
>     site of fn3 is predicated on dst_ptr being null (i.e., if
>     (!dst_ptr) dst_ptr = fn3()), but that code is predicated on a_ptr
>     being non-null.  Therefore, we know the condition !dst_ptr is
>     false (because a_ptr == dst_ptr and a_ptr is non-null) and the
>     call to fn3 is dead.  I suspect one of JumpThreading, EarlyCSE, or
>     GVN does the elimination after inlining, so that's what I'd like
>     to try and model in the inline cost model.  (Note fn2 has multiple
>     call sides and the property that the first and second arguments
>     are equal isn't true for each call site, so something like IPSCCP
>     doesn't actually help, AFAICT).
>
>     My first attempt at solving this problem did something similar to
>     what is done in JumpThreadingPass::ProcessImpliedCondition().
>     Specifically, it tried to prove that dst_ptr was non-null based on
>     a dominating condition.  The only tricky parts were to deal with
>     hammocks/diamonds when walking up the CFG (See:
>     https://reviews.llvm.org/D36287 <https://reviews.llvm.org/D36287>
>     as a concrete example of how I proposed to get an immediate
>     dominator without the domtree) and to account for the fact that
>     dst_ptr and a_ptr are equal.
>
>     I'm pretty sure I can get this approach to work, however, I'm not
>     convinced it's really extensible or general.  Should we consider
>     using the full dominator tree in the inline cost model to capture
>     this?
>
>     If you have any thoughts on how to tackle the problem, I would
>     love to hear your feedback!
>
>      Chad
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/1aff0757/attachment.html>