[PATCH] D16836: [CodeGenPrepare] Don't transform select instructions into branches when both of operands are cheap

Sat Feb 13 21:54:12 PST 2016

flyingforyou added a comment.

Thanks, Sanjay, Hal, Zia.

I think I couldn't explain what I really wanted to do. What I wanted to do is do not make branch(apart blocks) when test case is seen.

Zia,

> "if the true and false side of the branch are cheap, then the compare doesn't matter."  I think this is fundamentally wrong.

I agree with your mention regarding CMP heuristics which want to hide load's execution cycles.

> Even if both sides of the branch are NOPs, if we have high latency (yes, it's debatable whether a load falls into this category) instructions feeding into the comparison, the new conditional move has to sit and wait until the compare is resolved. With branches, the latency is completely hidden by the prediction (assuming it's correct). I'm not sure about ARM, but this is true for X86, at least.

I think recent OoO core's pipeline is deep enought to execute another uops which have no dependency with conditional move.

And I think Sanjay's second sugesstion is only about singking expensive instructions. I also think this is good idea, but the main problem is that I don't want to make branches.

Hal,

> And furthermore, since we don't (yet) have global instruction selection, breaking apart blocks like this early is almost always a bad idea (because it hampers instruction selection and scheduling, both are which are block local).

I also agree with your opinion. With this patch, we don't break basic block(making branch) for hiding select's stall when select's two operand's are cheap enough. 
Even if it is not logically perfect, this patch shows no regressions in test-suite on core-i5 sandy bridge.
F1508211: test-suite-result-x86-corei5-sandy.csv <http://reviews.llvm.org/F1508211>
I saw a little improvement on lua.

Sanjay,

> In any case, I don't think we should be reasoning about target-specific optimizations at this level (IR) without better target hooks to restrict the transform. And in this particular case, I think it is impossible for the compiler to know whether the load is expensive or not. This is why I suggest pushing that transform down to the DAG or machine-level where it could be limited more easily. Maybe this is something that only OoO x86 would like to do?

  -> in AArch64
  // Prefer likely predicted branches to selects on out-of-order cores.
  if (Subtarget->isCortexA57())
    PredictableSelectIsExpensive = true;

  -> X86
  // A predictable cmov does not hurt on an in-order CPU.
  // FIXME: Use a CPU attribute to trigger this, not a CPU model.
  PredictableSelectIsExpensive = !Subtarget.isAtom();

As you said, most of OoO X86 CPU turn on the flag.

When I ran "llc -mcpu=corei7 -mtriple=x86_64-linux  <  test/CodeGen/AArch64/arm64-select.ll", I can get result likes below. 
Without this patch.

  .Ltmp3:
          .cfi_offset %rbx, -24
  .Ltmp4:
          .cfi_offset %r14, -16
          movq    %rsi, %r14
          movq    %rdi, %rbx
          callq   _Z6getvalv
          movss   (%r14), %xmm1           # xmm1 = mem[0],zero,zero,zero
          ucomiss %xmm0, %xmm1
          jbe     .LBB0_2
  # BB#1:
          addq    $4, %rbx
          jmp     .LBB0_3
  .LBB0_2:                                # %select.false
          addq    $8, %rbx
  .LBB0_3:                                # %select.end
          movl    (%rbx), %eax
          addq    $8, %rsp
          popq    %rbx
          popq    %r14
          retq
  .Lfunc_end0:
          .size   test, .Lfunc_end0-test
          .cfi_endproc

With this patch,

  .Ltmp3:
          .cfi_offset %rbx, -24
  .Ltmp4:
          .cfi_offset %r14, -16
          movq    %rsi, %r14
          movq    %rdi, %rbx
          callq   _Z6getvalv
          movss   (%r14), %xmm1           # xmm1 = mem[0],zero,zero,zero
          leaq    4(%rbx), %rax
          addq    $8, %rbx
          ucomiss %xmm0, %xmm1
          cmovaq  %rax, %rbx
          movl    (%rbx), %eax
          addq    $8, %rsp
          popq    %rbx
          popq    %r14
          retq
  .Lfunc_end0:
          .size   test, .Lfunc_end0-test
          .cfi_endproc

I am not sure that making branch is better than using cmov.

> I would be very interested to know the perf results of completely removing the heuristic-based load transform. I'm not sure what LLVM policy is for this type of situation.

Thanks for sugesstion this, I will run several benchmarks and share the data soon.

Anyway, If we can't modify CMP heuristic, How can we modify EarlyIfConversion's likes below?

  bool EarlyIfConverter::shouldConvertIf() {
  ....
    // Look at all the tail phis, and compute the critical path extension caused
    // by inserting select instructions.
    MachineTraceMetrics::Trace TailTrace = MinInstr->getTrace(IfConv.Tail);
    for (unsigned i = 0, e = IfConv.PHIs.size(); i != e; ++i) {
      SSAIfConv::PHIInfo &PI = IfConv.PHIs[i];
      unsigned Slack = TailTrace.getInstrSlack(PI.PHI);
      unsigned MaxDepth = Slack + TailTrace.getInstrCycles(PI.PHI).Depth;
      DEBUG(dbgs() << "Slack " << Slack << ":\t" << *PI.PHI);

      // The condition is pulled into the critical path.
      unsigned CondDepth = adjCycles(BranchDepth, PI.CondCycles);
      if (CondDepth > MaxDepth) {
        unsigned Extra = CondDepth - MaxDepth;
        DEBUG(dbgs() << "Condition adds " << Extra << " cycles.\n");
        if (Extra > CritLimit) {
          DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
          return false;
        }
      }

      // The TBB value is pulled into the critical path.
      unsigned TDepth = adjCycles(TBBTrace.getPHIDepth(PI.PHI), PI.TCycles);
      if (TDepth > MaxDepth) {
        unsigned Extra = TDepth - MaxDepth;
        DEBUG(dbgs() << "TBB data adds " << Extra << " cycles.\n");
        if (Extra > CritLimit) {
          DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
          return false;
        }
      }

      // The FBB value is pulled into the critical path.
      unsigned FDepth = adjCycles(FBBTrace.getPHIDepth(PI.PHI), PI.FCycles);
      if (FDepth > MaxDepth) {
        unsigned Extra = FDepth - MaxDepth;
        DEBUG(dbgs() << "FBB data adds " << Extra << " cycles.\n");
        if (Extra > CritLimit) {
          DEBUG(dbgs() << "Exceeds limit of " << CritLimit << '\n');
          return false;
        }
      }
    }
    return true;
  }

This is opposite pass of "select to branch".

-Junmo.

http://reviews.llvm.org/D16836