[PATCH] D17288: [CodeGenPrepare] Do select to branch transform when cmp's operand is expensive.

Sun Apr 3 23:10:17 PDT 2016

flyingforyou added inline comments.

================
Comment at: lib/CodeGen/CodeGenPrepare.cpp:4525
@@ +4524,3 @@
+    if (I && I->getOpcode() == Instruction::FDiv &&
+        STI->getSchedModel().FdivLatency >
+            STI->getSchedModel().MispredictPenalty)
----------------
flyingforyou wrote:
> Gerolf wrote:
> > It that really a good heuristic? Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong. 
> > Even when the divide latency is less than or equal to the branch mispredication penalty issuing a branch can be the better choice. That depends on the program behavior. 
> I also agree with this idea.. But what we can do for this in this patch?
> 
> 
> > It that really a good heuristic? 
> If you think this is not good, what heuristic do you recommend?
> 
> 
> I believe the reasoning you are looking for is this: in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. Practically the long latency of the divide covers for the (dynamic) instances when that assumption is wrong.

My point is this. When we remove the load-cmp-csel heuristic, there is a main point which is related with load's execution cycle. The heuristic assumes that load can be taken huge cycles during cache-miss. But recent uArchitecture has big cache especially if it supports OoO execution. So we don't need to worry about cache-miss most of cases.

div-cmp-csel is almost same idea likes above with cache-miss case. Most of uArchitecture executes floating point division with high latency. So, if we apply this heuristic, we can get huge benefit due to hiding division's execution cycles.

> in the presence of a long latency instruction assume the dependent branch is well predicted most of the time. 
About this, I think branch prediction is good, even if instruction's execution cycle is small. But if the prediction is failed when executing short latency instructions something likes "add-cmp-branch", we can easily recognize the tranformation is wrong. So we just try "div-cmp-branch" case.

http://reviews.llvm.org/D17288