[PATCH] D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0

Thu Nov 10 20:28:30 PST 2022

bcl5980 added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:8553
+  // The leaf node must be XOR
+  if (N->getOpcode() == ISD::XOR && N->hasOneUse()) {
+    WorkList.push_back(std::make_pair(N->getOperand(0), N->getOperand(1)));
----------------
I believe the leaf node needn't one-use. It will not increase the instruction count.

================
Comment at: llvm/test/CodeGen/AArch64/bcmp.ll:409
 ; CHECK-NEXT:    ldp x10, x11, [x1]
-; CHECK-NEXT:    ldp x12, x13, [x0, #16]
-; CHECK-NEXT:    ldp x14, x15, [x1, #16]
-; CHECK-NEXT:    eor x8, x8, x10
-; CHECK-NEXT:    eor x9, x9, x11
-; CHECK-NEXT:    ldp x16, x17, [x0, #32]
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    ldp x18, x2, [x1, #32]
-; CHECK-NEXT:    eor x12, x12, x14
-; CHECK-NEXT:    eor x13, x13, x15
-; CHECK-NEXT:    ldp x3, x0, [x0, #48]
-; CHECK-NEXT:    orr x9, x12, x13
-; CHECK-NEXT:    ldp x10, x11, [x1, #48]
-; CHECK-NEXT:    eor x14, x16, x18
-; CHECK-NEXT:    eor x15, x17, x2
-; CHECK-NEXT:    orr x12, x14, x15
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    eor x10, x3, x10
-; CHECK-NEXT:    eor x11, x0, x11
-; CHECK-NEXT:    orr x10, x10, x11
-; CHECK-NEXT:    orr x9, x12, x10
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    cmp x8, #0
+; CHECK-NEXT:    cmp x8, x10
+; CHECK-NEXT:    ccmp x9, x11, #0, eq
----------------
Allen wrote:
> bcl5980 wrote:
> > I agree that cmp+ccmp chain is generally better but a little worry about this test case.
> > cmp chain need 8 cycles to do on every machine.
> > But 8 xor + 7 or + 1 cmp can run faster on high end cpu. For example a 4 width int alu port machine.
> > 2 cycle for xor
> > 3 cycle for or
> > 1 cycle for cmp
> > total 6 cycle.
> Good catch.  In general, all of the XOR, OR and CMP use ALU ports, so data dependency will become the bottleneck on high end CPU.
> If so, an additional parameter is needed to guard the max number of xors ? Or some other suggestion?
> 
I'm also not sure if we need a max leaf node limitation. Max size of bcmp expand is 64bytes. So larger size also needn't worry about it. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137721/new/

https://reviews.llvm.org/D137721