[PATCH][CodeGenPrepare] Split branch conditions into multiple branches.

Mon Dec 8 18:45:51 PST 2014

Hi Jim,

I updated the patch with your comments. I originally tried to enable it also for SelectionDAG, but there were some regressions in the unit tests for PPC and X86 which I have to look into first.

Here is the updated patch. I modified and included also the pattern match code, but I will break that out before I commit it.

Cheers,
Juergen

> On Dec 8, 2014, at 11:06 AM, Jim Grosbach <grosbach at apple.com> wrote:
> 
> Hi Juergen,
> 
> This looks great. A few comments and questions inline.
> 
>> From f8a3aecf2439256286bbe585d03161ac8750f7b1 Mon Sep 17 00:00:00 2001
>> From: Juergen Ributzka <juergen at apple.com <mailto:juergen at apple.com>>
>> Date: Thu, 4 Dec 2014 14:31:43 -0800
>> Subject: [PATCH 2/2] [CodeGenPrepare] Split branch conditions into multiple
>>  conditional branches.
>> 
>> This optimization transforms code like:
>> bb1:
>>   %0 = icmp ne i32 %a, 0
>>   %1 = icmp ne i32 %b, 0
>>   %or.cond = or i1 %0, %1
>>   br i1 %or.cond, label %TrueBB, label %FalseBB
>> 
>> into a multiple branch instructions like:
>> 
>> bb1:
>>   %0 = icmp ne i32 %a, 0
>>   br i1 %0, label %TrueBB, label %bb2
>> bb2:
>>   %1 = icmp ne i32 %b, 0
>>   br i1 %1, label %TrueBB, label %FalseBB
>> 
>> This optimization is already performed by SelectionDAG, but not by
>> FastISel. FastISel cannot perform this optimization, because it cannot
>> generate new MachineBasicBlocks.
>> 
> 
> Should we allow CGP to do this for SDag as well? It seems strange to do the same thing in two places.
> 
> 
>> This fixes rdar://problem/19034919 <rdar://problem/19034919>.
>> ---
>>  lib/CodeGen/CodeGenPrepare.cpp                     | 151 ++++++++++++++++++++-
>>  .../CodeGen/AArch64/fast-isel-branch-cond-split.ll |  42 ++++++
>>  2 files changed, 192 insertions(+), 1 deletion(-)
>>  create mode 100644 test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
>> 
>> diff --git a/lib/CodeGen/CodeGenPrepare.cpp b/lib/CodeGen/CodeGenPrepare.cpp
>> index 8d20848..eaee445 100644
>> --- a/lib/CodeGen/CodeGenPrepare.cpp
>> +++ b/lib/CodeGen/CodeGenPrepare.cpp
>> @@ -30,6 +30,7 @@
>>  #include "llvm/IR/InlineAsm.h"
>>  #include "llvm/IR/Instructions.h"
>>  #include "llvm/IR/IntrinsicInst.h"
>> +#include "llvm/IR/MDBuilder.h"
>>  #include "llvm/IR/PatternMatch.h"
>>  #include "llvm/IR/ValueHandle.h"
>>  #include "llvm/IR/ValueMap.h"
>> @@ -165,6 +166,7 @@ typedef DenseMap<Instruction *, TypeIsSExt> InstrToOrigTy;
>>      bool DupRetToEnableTailCallOpts(BasicBlock *BB);
>>      bool PlaceDbgValues(Function &F);
>>      bool sinkAndCmp(Function &F);
>> +    bool splitBranchCondition(Function &F);
>>    };
>>  }
>>  
>> @@ -218,8 +220,10 @@ bool CodeGenPrepare::runOnFunction(Function &F) {
>>    // into a single target instruction, push the mask and compare into branch
>>    // users. Do this before OptimizeBlock -> OptimizeInst ->
>>    // OptimizeCmpExpression, which perturbs the pattern being searched for.
>> -  if (!DisableBranchOpts)
>> +  if (!DisableBranchOpts) {
>>      EverMadeChange |= sinkAndCmp(F);
>> +    EverMadeChange |= splitBranchCondition(F);
>> +  }
>>  
>>    bool MadeChange = true;
>>    while (MadeChange) {
>> @@ -3793,3 +3797,148 @@ bool CodeGenPrepare::sinkAndCmp(Function &F) {
>>    }
>>    return MadeChange;
>>  }
>> +
>> +/// \brief Scale down both weights to fit into uint32_t.
>> +static void scaleWeights(uint64_t &NewTrue, uint64_t &NewFalse) {
>> +  uint64_t NewMax = (NewTrue > NewFalse) ? NewTrue : NewFalse;
>> +  uint32_t Scale = (NewMax / UINT32_MAX) + 1;
>> +  NewTrue = NewTrue / Scale;
>> +  NewFalse = NewFalse / Scale;
>> +}
>> +
>> +/// \brief Some target prefer to split a conditional branch like:
> 
> s/target/targets/
> 
> Probably worth indicating the criteria by which those targets are identified. In the code it’s just whether branches are considered expensive, so we can just mention that explicitly here, too.
> 
>> +/// \code
>> +///   %0 = icmp ne i32 %a, 0
>> +///   %1 = icmp ne i32 %b, 0
>> +///   %or.cond = or i1 %0, %1
>> +///   br i1 %or.cond, label %TrueBB, label %FalseBB
>> +/// \endcode
>> +/// into a multiple branch instructions like:
>> +/// \code
>> +///   bb1:
>> +///     %0 = icmp ne i32 %a, 0
>> +///     br i1 %0, label %TrueBB, label %bb2
>> +///   bb2:
>> +///     %1 = icmp ne i32 %b, 0
>> +///     br i1 %1, label %TrueBB, label %FalseBB
>> +/// \endcode
>> +/// This usually allows instruction selection to do even further optimizations
>> +/// and combine the compare with the branch instruction.
>> +bool CodeGenPrepare::splitBranchCondition(Function &F) {
>> +  if (!TM || TM->Options.EnableFastISel != true ||
>> +      !TLI || TLI->isJumpExpensive())
>> +    return false;
>> +
>> +  bool MadeChange = false;
>> +  for (auto BB = F.begin(), E = F.end(); BB != E; ++BB) {
> 
> Why not a range-based for loop?
> 
>> +    // Does this BB end with the following?
>> +    //   %cond1 = icmp|fcmp|binary instruction ...
>> +    //   %cond2 = icmp|fcmp|binary instruction ...
>> +    //   %cond.or = or|and i1 %cond1, cond2
>> +    //   br i1 %cond.or label %dest1, label %dest2"
>> +    auto *Br1 = dyn_cast<BranchInst>(BB->getTerminator());
>> +    if (!Br1)
>> +      continue;
>> +
>> +    if (Br1->isUnconditional())
>> +      continue;
>> +
>> +    auto *LogicOp = dyn_cast<BinaryOperator>(Br1->getCondition());
>> +    if (!LogicOp)
>> +      continue;
>> +
>> +    unsigned Idx;
>> +    if (LogicOp->hasOneUse() && (LogicOp->getOpcode() == Instruction::And))
>> +      Idx = 0;
>> +    else if (LogicOp->hasOneUse() && (LogicOp->getOpcode() == Instruction::Or))
>> +      Idx = 1;
>> +    else
>> +      continue;
>> +
>> +    auto *Cond1 = dyn_cast<Instruction>(LogicOp->getOperand(0));
>> +    auto *Cond2 = dyn_cast<Instruction>(LogicOp->getOperand(1));
>> +
>> +    if (!Cond1 || (!isa<CmpInst>(Cond1) && !isa<BinaryOperator>(Cond1)) ||
>> +        !Cond2 || (!isa<CmpInst>(Cond2) && !isa<BinaryOperator>(Cond2))   )
>> +      continue;
>> +
>> +    if (!Cond1->hasOneUse() || !Cond2->hasOneUse())
>> +      continue;
> 
> This matching sequence might be simpler to express with PatternMatch.h.
> 
>> +
>> +    BasicBlock *TrueDest = Br1->getSuccessor(0);
>> +    BasicBlock *FalseDest = Br1->getSuccessor(1);
>> +
>> +    DEBUG(dbgs() << "Before branch condition splitting\n"; BB->dump());
>> +
>> +    // Create a new BB.
>> +    auto *InsertBefore = std::next(Function::iterator(BB))
>> +        .getNodePtrUnchecked();
>> +    auto NewBB = BasicBlock::Create(BB->getContext(),
>> +                                    BB->getName() + ".cond.split",
>> +                                    BB->getParent(), InsertBefore);
>> +
>> +    // Update original basic block by using the first condition directly by the
>> +    // branch instruction and removing the no longer needed and/or instruction.
>> +    Br1->setCondition(Cond1);
>> +    LogicOp->eraseFromParent();
>> +    Cond2->removeFromParent();
>> +    Br1->setSuccessor(Idx, NewBB);
>> +
>> +    // Fill in the new basic block.
>> +    auto *Br2 = BranchInst::Create(TrueDest, FalseDest, Cond2, NewBB);
>> +    Cond2->insertBefore(Br2);
> 
> Using IRBuilder rather than explicit BranchInstr::Create may be cleaner/clearer.
> 
>> +
>> +    // Update PHI nodes in both successors.
>> +    if (Idx == 1)
>> +      std::swap(TrueDest, FalseDest);
>> +
> 
> Creating a new conditional branch instruction for TrueDest and FalseDest then immediately swapping the values for those pointers is odd. Probably correct based on the rest of the code, but a more detailed comment about what’s going on would be handy.
> 
> 
>> +    // Replace the old BB with the new BB.
>> +    for (auto I = TrueDest->begin(), IE = TrueDest->end(); I != IE; ++I) {
> 
> Range based loop?
> 
>> +      PHINode *PN = dyn_cast<PHINode>(I);
>> +      if (!PN)
>> +        break;
>> +      int i;
>> +      while ((i = PN->getBasicBlockIndex(BB)) >= 0)
>> +        PN->setIncomingBlock(i, NewBB);
>> +    }
>> +
>> +    // Add another incoming edge form the new BB.
>> +    for (auto I = FalseDest->begin(), IE = FalseDest->end(); I != IE; ++I) {
> 
> Range based loop?
> 
>> +      PHINode *PN = dyn_cast<PHINode>(I);
>> +      if (!PN)
>> +        break;
>> +      auto *Val = PN->getIncomingValueForBlock(BB);
>> +      PN->addIncoming(Val, NewBB);
>> +    }
>> +
>> +    // Update the branch weights (based on SelectionDAGBuilder::
>> +    // FindMergedConditions).
>> +    uint64_t TrueWeight, FalseWeight;
>> +    if (Br1->getBranchWeights(TrueWeight, FalseWeight)) {
>> +      uint64_t NewTrueWeight = TrueWeight;
>> +      uint64_t NewFalseWeight = TrueWeight + 2 * FalseWeight;
>> +
> 
> These numbers could use a bit of explanation.
> 
>>      scaleWeights(NewTrueWeight, NewFalseWeight);
>> +      Br1->setMetadata(LLVMContext::MD_prof, MDBuilder(Br1->getContext())
>> +          .createBranchWeights(TrueWeight, FalseWeight));
>> +
>> +      NewTrueWeight = TrueWeight;
>> +      NewFalseWeight = 2 * FalseWeight;
>> +      scaleWeights(NewTrueWeight, NewFalseWeight);
>> +      Br2->setMetadata(LLVMContext::MD_prof, MDBuilder(Br2->getContext())
>> +          .createBranchWeights(TrueWeight, FalseWeight));
>> +    }
>> +
>> +    // Request DOM Tree update.
>> +    // Note: No point in getting fancy here, since the DT info is never
>> +    // available to CodeGenPrepare and the existing update code is broken
>> +    // anyways.
>> +    ModifiedDT = true;
>> +
>> +    MadeChange = true;
>> +
>> +    DEBUG(dbgs() << "After branch condition splitting\n"; BB->dump();
>> +          NewBB->dump());
>> +  }
>> +
>> +  return MadeChange;
>> +}
>> diff --git a/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll b/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
>> new file mode 100644
>> index 0000000..058007b
>> --- /dev/null
>> +++ b/test/CodeGen/AArch64/fast-isel-branch-cond-split.ll
>> @@ -0,0 +1,42 @@
>> +; RUN: llc -mtriple=aarch64-apple-darwin -fast-isel -fast-isel-abort -verify-machineinstrs < %s | FileCheck %s
>> +
>> +; CHECK-label: test_or
>> +; CHECK:       cbnz w0, {{LBB[0-9]+_2}}
>> +; CHECK:       cbz w1, {{LBB[0-9]+_1}}
>> +define i64 @test_or(i32 %a, i32 %b) {
>> +bb1:
>> +  %0 = icmp eq i32 %a, 0
>> +  %1 = icmp eq i32 %b, 0
>> +  %or.cond = or i1 %0, %1
>> +  br i1 %or.cond, label %bb3, label %bb4, !prof !0
>> +
>> +bb3:
>> +  ret i64 0
>> +
>> +bb4:
>> +  %2 = call i64 @bar()
>> +  ret i64 %2
>> +}
>> +
>> +; CHECK-label: test_ans
>> +; CHECK:       cbz w0, {{LBB[0-9]+_2}}
>> +; CHECK:       cbnz w1, {{LBB[0-9]+_3}}
>> +define i64 @test_and(i32 %a, i32 %b) {
>> +bb1:
>> +  %0 = icmp ne i32 %a, 0
>> +  %1 = icmp ne i32 %b, 0
>> +  %or.cond = and i1 %0, %1
>> +  br i1 %or.cond, label %bb4, label %bb3, !prof !1
>> +
>> +bb3:
>> +  ret i64 0
>> +
>> +bb4:
>> +  %2 = call i64 @bar()
>> +  ret i64 %2
>> +}
>> +
>> +declare i64 @bar()
>> +
>> +!0 = metadata !{metadata !"branch_weights", i32 5128, i32 32}
>> +!1 = metadata !{metadata !"branch_weights", i32 1024, i32 4136}
>> -- 
>> 2.1.3
>> 
> 
> 
> 
> 
>> On Dec 4, 2014, at 3:59 PM, Juergen Ributzka <juergen at apple.com <mailto:juergen at apple.com>> wrote:
>> 
>> Updated patch. This time with a test case and the optimization only kicks in when FastISel is enabled.
>> 
>> <0002-CodeGenPrepare-Split-branch-conditions-into-multiple-v2.patch>
>>> On Dec 4, 2014, at 2:57 PM, Juergen Ributzka <juergen at apple.com <mailto:juergen at apple.com>> wrote:
>>> 
>>> Hi @ll,
>>> 
>>> SimplifyCFG likes to fold conditional branches to a common destination into a single conditional branch, where the original conditions can be combined with an and/or instruction (FoldBranchToCommonDest).
>>> 
>>> Example:
>>> bb1:
>>> %0 = icmp eq i32 %a, 0
>>> br i1 %0, label %bb3, label %bb2, !prof !0
>>> 
>>> bb2:
>>> %1 = icmp eq i32 %b, 0
>>> br i1 %1, label %bb3, label %bb4, !prof !1
>>> 
>>> is optimized to:
>>> 
>>> bb1:
>>> %0 = icmp eq i32 %a, 0
>>> %1 = icmp eq i32 %b, 0
>>> %or.cond = or i1 %0, %1
>>> br i1 %or.cond, label %bb3, label %bb4, !prof !0
>>> 
>>> 
>>> During SelectionDAG time this transformation is reversed, because it is usually more efficient (code size and performance) to express this code as multiple branches. SelectionDAG can do this, because it can generate new MachineBasicBlocks. FastISel on the other side cannot do this and generates inefficient code (I see 10% regression on some benchmarks).
>>> 
>>> The attached patch fixes this by performing the splitting early during CodeGenPrepare. In theory this code can be shared now by SelectionDAG and FastISel and the original implementation in SelectionDAG could be removed.
>>> 
>>> 
>>> Any comments and/or ideas about this approach in general? Is there a better way to do this?
>>> 
>>> Thanks
>>> 
>>> Cheers,
>>> Juergen
>>> 
>>> <0001-Move-function-to-obtain-branch-weights-into-the-Bran.patch><0002-CodeGenPrepare-Split-branch-conditions-into-multiple.patch>
>>> 
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141208/3c7bafef/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-CodeGenPrepare-Split-branch-conditions-into-multiple.patch
Type: application/octet-stream
Size: 15285 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141208/3c7bafef/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141208/3c7bafef/attachment-0001.html>