[PATCH] ARM64: Combine shifts and uses to bit-extract instruction from different basic block

Wed Apr 16 17:23:47 PDT 2014

Hi Yi,

I hope the format of the review will not be too difficult to digest. For the next iterations you may want to create a phabricator (http://reviews.llvm.org).

I have a couple of concerns that I split into two categories:
- Thing to fix: Are fundamental things that would produce correctness issue or hard to maintain problems.
- Nice to have: Are nitpick on the code.

** Things to Fix **

* Logic *

#1
+          (!TLI.isTypeLegal(TLI.getValueType(User->getType())
This does not seem a sufficient check to guess whether or not a truncate will be issued at legalization.
We should instead check if the operation is legal for this type.
Indeed, i16 may not be a legal type, but for instance store i16 may and no truncate will be issued.

#2
+            TheUse = InsertedShift;     /// <———— This does not look right to me.
+            TruncTheUse = InsertedTrunc;
+          }
Unless I am mistaken, you are transferring the use, in the original truncate, of the original shift into a use of the new shift in the original truncate.
This use must be left untouched.

You may want to check if the original truncate still have uses and nuke it if it does not, though.

 #3
+  assert((Op0->getOpcode() == ISD::TRUNCATE || BiggerPattern ||
+          (Srl_imm > 0 && Srl_imm < VT.getSizeInBits())) &&

Based on the code this patch change, you shouldn’t have to modify this assert. If you do have to modify it, it seems to me that there is something wrong.
Please double check.

#4
   // we're looking for a shift of a shift
   uint64_t Shl_imm = 0;
   if (isOpcWithIntImmediate(N->getOperand(0).getNode(), ISD::SHL, Shl_imm)) {
     Opd0 = N->getOperand(0).getOperand(0);
+  } else if (VT == MVT::i32 && N->getOpcode() == ISD::SRL &&
+             N->getOperand(0).getNode()->getOpcode() == ISD::TRUNCATE) {
+    // In this case, 64bit UBFM and 32bit UBFM have the same semantics. Our
+    // strategy here is to always generate 64bit UBFM. This consistency will
+    // help the CSE pass later find more redundancy.
+    Opd0 = N->getOperand(0).getOperand(0);
+    DstVT = MVT::i64;

There are several problems here.
A - This part of the code is looking for SHL, not SRL. The part looking for the SRL is later in the same function:
  if (!isIntImmediate(N->getOperand(1), Srl_imm))
    return false;
Your logic should be added somewhere here.
B - You did not update the Srl_imm information. Thus the resulting code may be wrong.

#5
+    DstVT = MVT::i64;
You should reuse VT here (assuming you moved the related code at the right place) like you did in the AND case.
More importantly, please use the type from the node (and return false on VT != MVT::i64 or at least assert that MVT::i64), not a hard coded one.

Remove this change:
-  if (VT == MVT::i32)
+  if (DstVT == MVT::i32)

#6
+  // If the bit extract operation is 64bit but the original type is 32bit, we
+  // need to add one EXTRACT_SUBREG.
+  if (Opc == ARM64::SBFMXri || Opc == ARM64::UBFMXri) {
+    if (VT == MVT::i32) {
+      SDNode *BFM = CurDAG->getMachineNode(Opc, SDLoc(N), MVT::i64, Ops);

Technically speaking you should create different Ops when promoting to the i64 type.
Indeed, current Ops were created with:
  SDValue Ops[] = { Opd0, CurDAG->getTargetConstant(LSB, VT),
                    CurDAG->getTargetConstant(MSB, VT) };
Thus, the VTs do not match for the constants.

#7
+  // If the bit extract operation is 64bit but the original type is 32bit, we
+  // need to add one EXTRACT_SUBREG.
+  if (Opc == ARM64::SBFMXri || Opc == ARM64::UBFMXri) {
+    if (VT == MVT::i32) {
Sink the second if into the first one. That will reduce the indentation as a bonus :).

#8
+  if (Opc == ARM64::SBFMXri || Opc == ARM64::UBFMXri) {
+    if (VT == MVT::i32) {
I am not a big fan of the way you detect the fact that the type has been promoted.
I think it would be cleaner to add (yet) another reference argument to know whether or not we have to do that.
Ultimately, it looks like we could benefit from a helper class here, but the refactoring is out of the scope of this patch.

* Tests *

#9
I’d like you add an additional RUN line with opt and check the actual transformation done on the IR.

#10
+; CHECK-LABEL: LBB22_2:
Please do not rely on basic blocks labels. That looks fragile.
+; CHECK-LABEL: Lloh4:
This is even more true for these that are added by the ARM64CollectLOH pass.

* Comments *

#11
+/// OptimizeExtractBits - sink the given shift instruction into user blocks if
+/// the uses could potentially be combined with the shift instruction and
+/// generate BitExtract instruction. It will only be applied if the architecture
+/// supports BitExtract instruction.
+///
+/// Return true if any changes are made.
You should be more specific here. Like specifying the pattern you are recognizing.
Indeed the shifts that are supported in this patch are logical and arithmetic right shift.
In other words, this optimization should help for pattern like this:
Input:
XXXXXXPPPXXX
Output:
000000000PPP
(shift right + and of low bits)

But not for pattern like this:
Input:
XXXXXXPPPXXX
Output:
0000PPP00000
(shift left + and of contiguous bits)

#12
+    if (UserBB == DefBB) {
+      // The type of the use is not legal in this architecture so there is a
+      // implicit truncation.
+      // for example:
+      // BB1:
+      // i64 shift.result = shlr i64 opnd, imm
+      // trunc.result = trunc shift.result to i16
+      //
+      // BB2:
+      //   ----> We will have a implicit truncate here if the architecture does
+      //   not have i16 register.
+      // use i16 trunc.result

This comment is confusing because:
A - You mention two basic blocks after a check that both the defBB and the useBB are equal.
B - The first sentence is missing a ‘if’ I guess.

What about something like (do not copy/paste, this is not a wonderful wording either :)):
If UseBB and DefBB are equal and use is a truncate, we may still want to sink both the shift and the truncate within the block of the use of the truncate.
Indeed, if the use of the truncate is not legal it may produce a truncate at legalization.

** Nice to Have **

* lib/CodeGen/CodeGenPrepare.cpp *

#13
The body of OptimizeExtractBits gets a bit complex, I think some refactoring (essentially helper functions) would help.

+    // The candidates that could be combined with shift instruction.
+    // 1. Truncate instruction
+    // 2. And instruction and the imm is a mask of the low bits: 
+    // imm & (imm+1) == 0
+    if (!isa<TruncInst>(User)) {
+      if (User->getOpcode() != Instruction::And ||
+          !isa<ConstantInt>(User->getOperand(1)))
+        continue;
+
+      unsigned Cimm =
+          dyn_cast<ConstantInt>(User->getOperand(1))->getZExtValue();
+
+      if (Cimm & (Cimm + 1))
+        continue;
+    }
=>
move that into a helper function like 'bool isExtractBitsCandidateUse’.
Then, adapt the loop accordingly.

#14
Same for this:
+      // The type of the use is not legal in this architecture so there is a
+      // implicit truncation.
+      // for example:
+      // BB1:
+      // i64 shift.result = shlr i64 opnd, imm
+      // trunc.result = trunc shift.result to i16
+      //
+      // BB2:
+      //   ----> We will have a implicit truncate here if the architecture does
+      //   not have i16 register.
+      // use i16 trunc.result
+      //
+      if (isa<TruncInst>(User) &&
+          (TLI.isTypeLegal(TLI.getValueType(ShiftI->getType()))) &&
+          (!TLI.isTypeLegal(TLI.getValueType(User->getType())))) {
+        TruncInst *TruncI = dyn_cast<TruncInst>(User);
+
+        for (Value::user_iterator TruncUI = TruncI->user_begin(),
+                                  TruncE = TruncI->user_end();
+             TruncUI != TruncE;) {
+
+          Use &TruncTheUse = TruncUI.getUse();
+          Instruction *TruncUser = cast<Instruction>(*TruncUI);
+          // Preincrement use iterator so we don't invalidate it.
+
+          ++TruncUI;
+
+          // Don't bother for PHI nodes.
+          if (isa<PHINode>(TruncUser))
+            continue;
+
+          BasicBlock *TruncUserBB = TruncUser->getParent();
+
+          if (UserBB == TruncUserBB)
+            continue;
+
+          BinaryOperator *&InsertedShift = InsertedShifts[TruncUserBB];
+          CastInst *&InsertedTrunc = InsertedTruncs[TruncUserBB];
+
+          if (!InsertedShift && !InsertedTrunc) {
+            BasicBlock::iterator InsertPt = TruncUserBB->getFirstInsertionPt();
+            // Sink the shift
+            if (ShiftI->getOpcode() == Instruction::AShr)
+              InsertedShift = BinaryOperator::CreateAShr(ShiftI->getOperand(0),
+                                                         CI, "", InsertPt);
+            else
+              InsertedShift = BinaryOperator::CreateLShr(ShiftI->getOperand(0),
+                                                         CI, "", InsertPt);
+
+            // Sink the trunc
+            BasicBlock::iterator TruncInsertPt =
+                TruncUserBB->getFirstInsertionPt();
+            TruncInsertPt++;
+
+            InsertedTrunc =
+                CastInst::Create(TruncI->getOpcode(), InsertedShift,
+                                 TruncI->getType(), "", TruncInsertPt);
+
+            MadeChange = true;
+
+            TheUse = InsertedShift;
+            TruncTheUse = InsertedTrunc;
+          }
+        }
+      }

You have a big chunk of code to handle a special case. Move that into a helper function.

* lib/Target/ARM64/ARM64ISelDAGToDAG.cpp *

#15
+    // If the shift result was truncated, we can still combine them.
+    Opd0 = Op0->getOperand(0).getNode()->getOperand(0);
You could simply use
Op0->getOperand(0).getOperand(0)

Thanks,
-Quentin

On Apr 16, 2014, at 9:36 AM, yijiang <yjiang at apple.com> wrote:

> Here is the new patch. More comments are appreciated!
> <extractbit-v2.patch>
> On Apr 16, 2014, at 9:29 AM, Yi Jiang <yjiang at apple.com> wrote:
> 
>> You are right. Thank you. 
>> On Apr 14, 2014, at 6:59 PM, Matt Arsenault <Matthew.Arsenault at amd.com> wrote:
>> 
>>> On 04/14/2014 05:25 PM, yijiang wrote:
>>>> +  void setHasExtractBitsInsn(bool hasExtractInsn = false) {
>>>> +    HasExtractBitsInsn = hasExtractInsn;
>>>> +  }
>>> The default argument should be true, since that's what all the others do and since the default value is already false
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140416/d1f2dc4b/attachment.html>