[PATCH] D16836: [CodeGenPrepare] Don't transform select instructions into branches when both of operands are cheap

Fri Feb 5 09:14:53 PST 2016

spatel added a comment.

In http://reviews.llvm.org/D16836#344790, @flyingforyou wrote:

> Addressed Hal's comment.

I don't think the patch implements what Hal suggested. You're checking if the direct operands of a "free" instruction are not Arg/Phi/Const. The suggestion was to recurse/loop until we are sure that no operands in the chain are not free.

> Sanjay, Thank you for sharing long history of releated commit.

> 

> I agree with your concern also. But first commit was merged on 2012. It's almost 4 years ago. Recent OoO core has more logic for avoiding cache-miss. (Something likes HW prefetcher..)

Please correct me if I am not understanding: the fact that HW may have gotten *better* at caching is more reason to remove the heuristic-based load "optimization"; it means that a load is more likely to be cheap.

In any case, I don't think we should be reasoning about target-specific optimizations at this level (IR) without better target hooks to restrict the transform. And in this particular case, I think it is impossible for the compiler to know whether the load is expensive or not. This is why I suggest pushing that transform down to the DAG or machine-level where it could be limited more easily. Maybe this is something that only OoO x86 would like to do?

> And I also don't think your previous patch is hack. Even if we can avoid cache-miss penalty, if there are no enough heavy instructions like division, it can't get enough improvement. (Of course, it depends on instruction stream after branch.)

I think we're mixing up 2 independent issues now. The load transform and the TCC_Expensive transform have the same goal: don't execute an expensive instruction unnecessarily. But the first is driven by a heuristic (a single use load is expensive) while the second is driven by the fixed and known TTI cost of an instruction such as division.

> I also get some improvements on commercial benchmark both of X86, AArch64.

I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation. I've cc'd @bkramer in case it is possible to know which app/benchmark was improved by r156234. Clearly, it was driven by an improvement for OoO x86-64.

http://reviews.llvm.org/D16836