[PATCH] D136244: [AArch64] Optimize memcmp when the result is tested for [in]equality with 0

Wed Oct 26 06:53:14 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19499-19500
+      (LHS.getOperand(0)->getOpcode() == ISD::XOR &&
+       LHS.getOperand(1)->getOpcode() == ISD::XOR) &&
+      LHS.getOperand(0)->hasOneUse() && LHS.getOperand(1)->hasOneUse()) {
+    SDValue XOR0 = LHS.getOperand(0);
----------------
Allen wrote:
> bcl5980 wrote:
> > Allen wrote:
> > > bcl5980 wrote:
> > > > Allen wrote:
> > > > > bcl5980 wrote:
> > > > > > LHS should be OneUse also?
> > > > > The **LHS **node itself is not used in the return value when the pattern matched, so I don't think the OneUse is  needed, correct me if I'm wrong, thanks.
> > > > for example:
> > > > ```
> > > > int use(int);
> > > > int f(int a, int b, int c, int d)
> > > > {
> > > >    int xor0 = a ^ b;
> > > >    int xor1 = c ^ d;
> > > >    int or0   = xor0 | xor1;
> > > >    if (or0 != 0)
> > > >         return use(or0);
> > > >    return a;
> > > > }
> > > > ```
> > > > or0 is not one use. So we should keep all of the xor+or patterns. 
> > > >  
> > > * thanks for your case. If the or0 used more than one, then the  xor+or patterns will be keep as we delete then when we match the pattern
> > > * base your above case, it seems works, https://alive2.llvm.org/ce/z/699vcf
> > >    Of course, this match doesn't depend on that either, and I can add oneuse of LHS if you still worry about it.
> > > 
> > It looks your case source instructions is less than dest? 
> I'm just suggesting the multi-use is allowed base alive2
> as I don't know how express the ccmp instruction, this is not accurate. 
> ok, I'll add the oneuse of LHS, as it is still not fully agreed, thanks very much.
I see. Can you add the testcase where the LHS has multiple uses:
```
define i32 @multiuse(i32 %0, i32 %1, i32 %2, i32 %3) {
  %5 = xor i32 %1, %0
  %6 = xor i32 %3, %2
  %7 = or i32 %6, %5
  %8 = icmp eq i32 %7, 0
  br i1 %8, label %11, label %9

9:                                                ; preds = %4
  %10 = tail call i32 @use(i32 %7) #2
  br label %11

11:                                               ; preds = %4, %9
  %12 = phi i32 [ %10, %9 ], [ %0, %4 ]
  ret i32 %12
}
```

We just need to make sure it doesn't increase the number of instructions.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19510
+    SDValue NZCVOp = DAG.getConstant(0, DL, MVT::i32);
+    SDValue CCmp = DAG.getNode(AArch64ISD::CCMP, DL, MVT_CC, XOR1.getOperand(0),
+                               XOR1.getOperand(1), NZCVOp, CCVal, Overflow);
----------------
bcl5980 wrote:
> Allen wrote:
> > bcl5980 wrote:
> > > I am not sure if we can just combine to ISD::SETCC ? Maybe it can combine with some other op.
> > sorry, I don't understand what is the **ISD::SETCC**, could you please show more detailedly? as I don't find it in my changes.
> The code should be simpler by combine to SetCC:
> 
> ```
>     SDValue XOR0 = LHS.getOperand(0);
>     SDValue XOR1 = LHS.getOperand(1);
>     SDValue Cmp0 = DAG.getSetCC(DL, VT, XOR0.getOperand(0), XOR0.getOperand(1),
>                                 ISD::SETNE);
>     SDValue Cmp1 = DAG.getSetCC(DL, VT, XOR1.getOperand(0), XOR1.getOperand(1),
>                                 ISD::SETNE);
>     SDValue Cmp = DAG.getNode(ISD::OR, DL, VT, Cmp0, Cmp1);
>     return DAG.getSetCC(DL, VT, Cmp, DAG.getConstant(0, DL, VT), Cond);
> ```
> But may fall into potential dead loop if somewhere has the reverse combination. 
> @dmgreen , which way do you think is better?
Hmm. It is not worth it if we are taking two steps to do what we could do in one. But there could be further DAG combiners for the setcc. I'd say this method is fine so long as we don't do it too early. If we find cases where there are missing combines, we can always add extra folds for the CMP/CCMPs.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19506
+    SDValue Cmp =
+        DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(TstVT, MVT::Glue),
+                    XOR0.getOperand(0), XOR0.getOperand(1));
----------------
Allen wrote:
> dmgreen wrote:
> > I'm not sure if this should be MVT::Glue or MVT::i32. It seems to be created differently in different places.
> I don't  very sure this is the accurate answer, it seems the **MVT::Glue** implicit instructions are scheduled together?
> https://lists.llvm.org/pipermail/llvm-dev/2014-June/074046.html
Glue is probably OK. Can you add this test case:
```
define i32 @eq(i128 noundef %x, i128 noundef %y) {
entry:
  %cmp3 = icmp eq i128 %x, %y
  %conv = trunc i128 %x to i64
  %conv1 = trunc i128 %y to i64
  %cmp = icmp eq i64 %conv, %conv1
  %or7 = or i1 %cmp3, %cmp
  %or = zext i1 %or7 to i32
  ret i32 %or
}
```

There may be issues with the CMP/CCMP with the scheduling of instructions that ISel will create out of the DAG, but I've not seen any happen yet.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136244/new/

https://reviews.llvm.org/D136244