[PATCH] D141653: [X86] Improve instruction ordering of constant `srl/shl` with `and` to get better and-masks

Sat Jan 14 02:05:15 PST 2023

pengfei added inline comments.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:47438

-static SDValue combineShiftLeft(SDNode *N, SelectionDAG &DAG) {
+// Try and  re-ordering an `and` and `srl/shl` if it result in a better constant
+// for the `and`. Note this only tries to optimize by re-ordering, other
----------------
results

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:47438

-static SDValue combineShiftLeft(SDNode *N, SelectionDAG &DAG) {
+// Try and  re-ordering an `and` and `srl/shl` if it result in a better constant
+// for the `and`. Note this only tries to optimize by re-ordering, other
----------------
pengfei wrote:
> results
One more space. You may run clang-format to help solvoing the format nits.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:47533
+    } else
+      NewAndMask = AndMask.lshr(ShiftCnt);
+
----------------
Use curly braces for consistency.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:47679

-  // Try to improve a sequence of srl (and X, C1), C2 by inverting the order.
-  // TODO: This is a generic DAG combine that became an x86-only combine to
-  // avoid shortcomings in other folds such as bswap, bit-test ('bt'), and
-  // and-not ('andn').
-  if (N0.getOpcode() != ISD::AND || !N0.hasOneUse())
-    return SDValue();
-
-  auto *ShiftC = dyn_cast<ConstantSDNode>(N1);
-  auto *AndC = dyn_cast<ConstantSDNode>(N0.getOperand(1));
-  if (!ShiftC || !AndC)
-    return SDValue();
-
-  // If we can shrink the constant mask below 8-bits or 32-bits, then this
-  // transform should reduce code size. It may also enable secondary transforms
-  // from improved known-bits analysis or instruction selection.
-  APInt MaskVal = AndC->getAPIntValue();
-
-  // If this can be matched by a zero extend, don't optimize.
-  if (MaskVal.isMask()) {
-    unsigned TO = MaskVal.countTrailingOnes();
-    if (TO >= 8 && isPowerOf2_32(TO))
-      return SDValue();
-  }
+  if(SDValue V = combineLogicalShiftWithAnd(N, DAG, DCI))
+    return V;
----------------
Add one space after `if`.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:49129

+  if(SDValue V = combineLogicalShiftWithAnd(N, DAG, DCI))
+    return V;
----------------
ditto.

================
Comment at: llvm/test/CodeGen/X86/bitreverse.ll:538
-; X64-NEXT:    andb $8, %al
-; X64-NEXT:    leal (%rdi,%rdi), %ecx
-; X64-NEXT:    andb $4, %cl
----------------
craig.topper wrote:
> pengfei wrote:
> > IIRC, LEA is expensive, so this looks like a good deal. 
> I thought LEA was only expensive if it uses 3 sources.
You are right. SOM says only 3 operand LEA is slow. But I found MCA cannot reflect the difference.
So this is yet another regression.
old: https://godbolt.org/z/dPW93v8zT
new: https://godbolt.org/z/hrEovKW4f

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141653/new/

https://reviews.llvm.org/D141653