[PATCH] D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0
chenglin.bi via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Nov 10 20:28:30 PST 2022
bcl5980 added inline comments.
================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:8553
+ // The leaf node must be XOR
+ if (N->getOpcode() == ISD::XOR && N->hasOneUse()) {
+ WorkList.push_back(std::make_pair(N->getOperand(0), N->getOperand(1)));
----------------
I believe the leaf node needn't one-use. It will not increase the instruction count.
================
Comment at: llvm/test/CodeGen/AArch64/bcmp.ll:409
; CHECK-NEXT: ldp x10, x11, [x1]
-; CHECK-NEXT: ldp x12, x13, [x0, #16]
-; CHECK-NEXT: ldp x14, x15, [x1, #16]
-; CHECK-NEXT: eor x8, x8, x10
-; CHECK-NEXT: eor x9, x9, x11
-; CHECK-NEXT: ldp x16, x17, [x0, #32]
-; CHECK-NEXT: orr x8, x8, x9
-; CHECK-NEXT: ldp x18, x2, [x1, #32]
-; CHECK-NEXT: eor x12, x12, x14
-; CHECK-NEXT: eor x13, x13, x15
-; CHECK-NEXT: ldp x3, x0, [x0, #48]
-; CHECK-NEXT: orr x9, x12, x13
-; CHECK-NEXT: ldp x10, x11, [x1, #48]
-; CHECK-NEXT: eor x14, x16, x18
-; CHECK-NEXT: eor x15, x17, x2
-; CHECK-NEXT: orr x12, x14, x15
-; CHECK-NEXT: orr x8, x8, x9
-; CHECK-NEXT: eor x10, x3, x10
-; CHECK-NEXT: eor x11, x0, x11
-; CHECK-NEXT: orr x10, x10, x11
-; CHECK-NEXT: orr x9, x12, x10
-; CHECK-NEXT: orr x8, x8, x9
-; CHECK-NEXT: cmp x8, #0
+; CHECK-NEXT: cmp x8, x10
+; CHECK-NEXT: ccmp x9, x11, #0, eq
----------------
Allen wrote:
> bcl5980 wrote:
> > I agree that cmp+ccmp chain is generally better but a little worry about this test case.
> > cmp chain need 8 cycles to do on every machine.
> > But 8 xor + 7 or + 1 cmp can run faster on high end cpu. For example a 4 width int alu port machine.
> > 2 cycle for xor
> > 3 cycle for or
> > 1 cycle for cmp
> > total 6 cycle.
> Good catch. In general, all of the XOR, OR and CMP use ALU ports, so data dependency will become the bottleneck on high end CPU.
> If so, an additional parameter is needed to guard the max number of xors ? Or some other suggestion?
>
I'm also not sure if we need a max leaf node limitation. Max size of bcmp expand is 64bytes. So larger size also needn't worry about it.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D137721/new/
https://reviews.llvm.org/D137721
More information about the llvm-commits
mailing list