[PATCH] D137721: [AArch64] Optimize more memcmp when the result is tested for [in]equality with 0

Sat Nov 12 04:23:30 PST 2022

Allen marked 2 inline comments as done.
Allen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:8585
+  SmallVector<std::pair<SDValue, SDValue>, 16> WorkList;
+  bool IsStrict = N->isStrictFPOpcode();
+  unsigned OpNo = IsStrict ? 1 : 0;
----------------
dmgreen wrote:
> This code doesn't handle float compares so you shouldn't need the IsStrict stuff. Maybe only call this from LowerSETCC if the Opcode is ISD::SETCC or LHS.getValueType().isInteger().
* ok. delete th IsStrict stuff

* the callsize from performSETCCCombine is still need as **brcond+setcc** will be combine into **br_cc** in the stage **Optimized type-legalized selection**, which is before the **Legalized selection DAG**, such as case br_on_cmp_i128_ne in file CodeGen/AArch64/i128-cmp.ll

================
Comment at: llvm/test/CodeGen/AArch64/bcmp.ll:409
 ; CHECK-NEXT:    ldp x10, x11, [x1]
-; CHECK-NEXT:    ldp x12, x13, [x0, #16]
-; CHECK-NEXT:    ldp x14, x15, [x1, #16]
-; CHECK-NEXT:    eor x8, x8, x10
-; CHECK-NEXT:    eor x9, x9, x11
-; CHECK-NEXT:    ldp x16, x17, [x0, #32]
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    ldp x18, x2, [x1, #32]
-; CHECK-NEXT:    eor x12, x12, x14
-; CHECK-NEXT:    eor x13, x13, x15
-; CHECK-NEXT:    ldp x3, x0, [x0, #48]
-; CHECK-NEXT:    orr x9, x12, x13
-; CHECK-NEXT:    ldp x10, x11, [x1, #48]
-; CHECK-NEXT:    eor x14, x16, x18
-; CHECK-NEXT:    eor x15, x17, x2
-; CHECK-NEXT:    orr x12, x14, x15
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    eor x10, x3, x10
-; CHECK-NEXT:    eor x11, x0, x11
-; CHECK-NEXT:    orr x10, x10, x11
-; CHECK-NEXT:    orr x9, x12, x10
-; CHECK-NEXT:    orr x8, x8, x9
-; CHECK-NEXT:    cmp x8, #0
+; CHECK-NEXT:    cmp x8, x10
+; CHECK-NEXT:    ccmp x9, x11, #0, eq
----------------
dmgreen wrote:
> Allen wrote:
> > bcl5980 wrote:
> > > Allen wrote:
> > > > bcl5980 wrote:
> > > > > I agree that cmp+ccmp chain is generally better but a little worry about this test case.
> > > > > cmp chain need 8 cycles to do on every machine.
> > > > > But 8 xor + 7 or + 1 cmp can run faster on high end cpu. For example a 4 width int alu port machine.
> > > > > 2 cycle for xor
> > > > > 3 cycle for or
> > > > > 1 cycle for cmp
> > > > > total 6 cycle.
> > > > Good catch.  In general, all of the XOR, OR and CMP use ALU ports, so data dependency will become the bottleneck on high end CPU.
> > > > If so, an additional parameter is needed to guard the max number of xors ? Or some other suggestion?
> > > > 
> > > I'm also not sure if we need a max leaf node limitation. Max size of bcmp expand is 64bytes. So larger size also needn't worry about it. 
> > Thanks,  so I add the max limitation number 6 for xors now.
> > If we can get more schedule model info, we may relex th e condition later.
> I can see what you mean, but I'm not sure we need to limit this case. In my experience this much of reduction in instruction count can be good for performance on its own, even if it turns the tree into a series of dependant ccmp. We could theoretically have larger trees though, so maybe put the limit to higher?
> 
> Unless you have performance results that actually shows it to be worse, we believe it is better on most cpus so I would have it perform the transform in this case. 8 less instructions is probably always worth 1 longer critical patch length.
Thanks, I adjust the max limitation 6->16, and this case it enable the transform (I don't have a machine with 4 width ALU port to test this case)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137721/new/

https://reviews.llvm.org/D137721