[PATCH] D13417: [MachineCombiner] make slack optional in critical path cost calculation (PR25016)

Wed Nov 4 16:06:57 PST 2015

> On Nov 4, 2015, at 9:58 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> 
> spatel added a comment.
> 
> In http://reviews.llvm.org/D13417#280784, @Gerolf wrote:
> 
>> I think we need your patch. It would be nice to have also a test case that shows an actual increase 
>> in critical path length and a measurable performance difference. In the test cases so far the vadd 
>> sequences take longer w/o actually increasing the critical path.
> 
> 
> Hmm...I'm not seeing a way to make that happen. Any suggestions on how to expose this?
I know. Right now we have reassoc on x86 and madd combine on ARM … We’ll get a test case eventually as we push more functionality into the combiner (and perhaps find a better name for what the machine combiner is actually doing). 
> 
> We should never increase the calculated critical path (which includes slack). We know from the example that uses 'call' instructions that our CPU model may be incorrect, and this may lead to a real-world increase in CPL. That is the scenario I'm hoping to avoid because we know that a CPU model in the compiler will never be a 100% accurate simulator for any sufficiently complex CPU.
I ignored the call modeling issue. That is separate. The other part we pushed aside in this thread is reassociation and register pressure. Both are orthogonal to your patch.

> 
>> So just by looking at a reassociation case in isolation I wouldn’t worry about modifying the heuristic. 
> 
>> However, in a scenario eg. with reassoc first and then a mul+add -> madd the bigger depth of the 
> 
>> re-associated instruction could result in not performing the mul+ add combine later. It could be that
> 
>> the extra depth may have eaten the slack the madd needs to fit within the critical path.
> 
> 
> I'm sorry, but I'm still not following this argument. Let's say that we don't even have a reassoc transform here, but the instructions themselves are already reassociated optimally. In that case, if a muladd transform that would be beneficial does not fire, then it must be a deficiency in the muladd logic or MTM, right? Ie, the cost calc for doing that transform should be independent of whether the code was transformed here or earlier by some other mechanism. We guarantee this by invalidating the MTM data between transforms here in MachineCombiner.

Yes, I wish I  could demonstrate this better in an actual test case ( the ’nice to have’ thing above). Perhaps this is more clear (or at least clear enough to possibly reveal a flaw in my thinking): The critical path length CPL is s depth + height + slack. mul+add -> madd as long as newCPL <= oldCPL. My hypothetical scenario assumes the mul is dependent on an instruction that gets reassociated. So it’s depth (and the depth of the add) could increase w/o your patch. That would increase the depth of the hypothetical madd and now that transformation may no longer happen because depth(madd) + latency(madd) could exceed the original CPL. 
> 
>> Also, since reassoc attempts to reduce depth shouldn’t the equation ignore latencies and 
> 
>> the code finally look something like if (MustReduceDepth) return NewDepth < OldDepth else 
> 
>> if (MustReduceCPL) return NewCycleCount < OldCycleCount else return 
> 
>> NewCycleCount <= OldCycleCount?
> 
> 
> Yes, this looks correct. I think this is almost identical to the logic implemented in this patch, but we can eliminate the extra calculation of 'RootLatency' and 'NewRootLatency' because we know they are always identical for reassociation. Let me fix this and update the patch.
The logic, yes.  But USE_SLACK is a vague concept. So I ‘d like to see it expressing explicitly what a new pattern attempts to do.
> 
>> Thanks for effort and being persistent.
> 
> 
> Thank you! I think I have a much better understanding of MTM now. :)

Me too :-)
> 
> 
> http://reviews.llvm.org/D13417
> 
> 
>