[PATCH] D17288: [CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.

Tue Apr 5 00:32:25 PDT 2016

Gerolf added inline comments.

================
Comment at: lib/CodeGen/CodeGenPrepare.cpp:4525
@@ +4524,3 @@
+    if (I && I->getOpcode() == Instruction::FDiv &&
+        STI->getSchedModel().FdivLatency >
+            STI->getSchedModel().MispredictPenalty)
----------------
flyingforyou wrote:
> flyingforyou wrote:
> > Gerolf wrote:
> > > It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong. 
> > > Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. 
> > I also agree with this idea.. But what we can do for this in this patch?
> > 
> > 
> > > It that really a good heuristic? 
> > If you think this is not good, what heuristic do you recommend?
> > 
> > 
> > I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.
> 
> My point is this. When we remove the load-cmp-csel heuristic, there is a main point which is related with load's execution cycle. The heuristic assumes that load can be taken huge cycles during cache-miss. But recent uArchitecture has big cache especially if it supports OoO execution. So we don't need to worry about cache-miss most of cases.
> 
> div-cmp-csel is almost same idea likes above with cache-miss case. Most of uArchitecture executes floating point division with high latency. So, if we apply this heuristic, we can get huge benefit due to hiding division's execution cycles.
> 
> 
> 
> > in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. 
> About this, I think branch prediction is good, even if instruction's execution cycle is small. But if the prediction is failed when executing short latency instructions something likes "add-cmp-branch", we can easily recognize the tranformation is wrong. So we just try "div-cmp-branch" case.
> 
> 
> 
When the branch is well predicted I don't see a reason to generate a csel (except for code size). The crux is the compiler has to model two unknowns: is there a hot path? and is there a branch misprediction penalty? Profiling helps, but is not always (or better perhaps, rarely) available. I think a reasonable heuristic and akin to what you are pursuing is this: Conceptually a csel merges two paths. When the paths are unbalanced don't generate a csel. The paths are unbalanced when their execution times differ "a lot". For example, if one path consumes a long latency operation, but not the other does not, consider the paths unbalanced and don't issue a csel. Or if you know on your uArch branches are rarely mispredicted across a wide range of apps, a csel should only be generated when there is a very specific reason for it.

================
Comment at: lib/CodeGen/CodeGenPrepare.cpp:4532
@@ +4531,3 @@
+
+  if (IsExpensiveCostInst(CmpOp0) || IsExpensiveCostInst(CmpOp1))
+    return true;
----------------
flyingforyou wrote:
> Gerolf wrote:
> > In the case both paths consume the long latency select is still the better choice.
> Why do you think so?
Both paths require the result of the long latency instruction. So at least it is less likely that your optimization helps.

http://reviews.llvm.org/D17288