[PATCH] D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0

Fri Nov 11 19:28:44 PST 2022

Allen marked an inline comment as done.
Allen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:8553
+  // The leaf node must be XOR
+  if (N->getOpcode() == ISD::XOR && N->hasOneUse()) {
+    WorkList.push_back(std::make_pair(N->getOperand(0), N->getOperand(1)));
----------------
bcl5980 wrote:
> I believe the leaf node needn't one-use. It will not increase the instruction count.
Done, thanks

================
Comment at: llvm/test/CodeGen/AArch64/bcmp.ll:409
 ; CHECK-NEXT:    ldp x10, x11, [x1]
-; CHECK-NEXT:    ldp x12, x13, [x0, #16]
-; CHECK-NEXT:    ldp x14, x15, [x1, #16]
-; CHECK-NEXT:    eor x8, x8, x10
-; CHECK-NEXT:    eor x9, x9, x11
-; CHECK-NEXT:    ldp x16, x17, [x0, #32]
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    ldp x18, x2, [x1, #32]
-; CHECK-NEXT:    eor x12, x12, x14
-; CHECK-NEXT:    eor x13, x13, x15
-; CHECK-NEXT:    ldp x3, x0, [x0, #48]
-; CHECK-NEXT:    orr x9, x12, x13
-; CHECK-NEXT:    ldp x10, x11, [x1, #48]
-; CHECK-NEXT:    eor x14, x16, x18
-; CHECK-NEXT:    eor x15, x17, x2
-; CHECK-NEXT:    orr x12, x14, x15
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    eor x10, x3, x10
-; CHECK-NEXT:    eor x11, x0, x11
-; CHECK-NEXT:    orr x10, x10, x11
-; CHECK-NEXT:    orr x9, x12, x10
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    cmp x8, #0
+; CHECK-NEXT:    cmp x8, x10
+; CHECK-NEXT:    ccmp x9, x11, #0, eq
----------------
bcl5980 wrote:
> Allen wrote:
> > bcl5980 wrote:
> > > I agree that cmp+ccmp chain is generally better but a little worry about this test case.
> > > cmp chain need 8 cycles to do on every machine.
> > > But 8 xor + 7 or + 1 cmp can run faster on high end cpu. For example a 4 width int alu port machine.
> > > 2 cycle for xor
> > > 3 cycle for or
> > > 1 cycle for cmp
> > > total 6 cycle.
> > Good catch.  In general, all of the XOR, OR and CMP use ALU ports, so data dependency will become the bottleneck on high end CPU.
> > If so, an additional parameter is needed to guard the max number of xors ? Or some other suggestion?
> > 
> I'm also not sure if we need a max leaf node limitation. Max size of bcmp expand is 64bytes. So larger size also needn't worry about it. 
Thanks,  so I add the max limitation number 6 for xors now.
If we can get more schedule model info, we may relex th e condition later.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137721/new/

https://reviews.llvm.org/D137721