[llvm] [AArch64] Combine signext_inreg of setcc(... != splat(0)) (PR #157665)

Wed Sep 10 03:24:06 PDT 2025

================
@@ -26097,6 +26097,17 @@ static SDValue performSetCCPunpkCombine(SDNode *N, SelectionDAG &DAG) {
   return SDValue();
 }
 
+static bool isSignExtInReg(const SDValue &V) {
+  if (V.getOpcode() != AArch64ISD::VASHR ||
----------------
david-arm wrote:

OK I think I understand this a bit more now ... If I lower these two functions:

```
define <16 x i8> @masked_load_v16i8(ptr %src, <16 x i1> %mask) {
  %load = call <16 x i8> @llvm.masked.load.v16i8(ptr %src, i32 8, <16 x i1> %mask, <16 x i8> zeroinitializer)
  ret <16 x i8> %load
}

define <16 x i8> @masked_load_v16i8_2(ptr %src, <16 x i8> %mask) {
  %icmp = icmp ugt <16 x i8> %mask, splat (i8 3)
  %load = call <16 x i8> @llvm.masked.load.v16i8(ptr %src, i32 8, <16 x i1> %icmp, <16 x i8> zeroinitializer)
  ret <16 x i8> %load
}
```

we actually end up with decent codegen for `masked_load_v16i8_2`:

```
masked_load_v16i8:
	shl	v0.16b, v0.16b, #7
	ptrue	p0.b, vl16
	cmlt	v0.16b, v0.16b, #0
	cmpne	p0.b, p0/z, z0.b, #0
	ld1b	{ z0.b }, p0/z, [x0]
	ret

masked_load_v16i8_2:
	movi	v1.16b, #3
	ptrue	p0.b, vl16
	cmphi	p0.b, p0/z, z0.b, z1.b
	ld1b	{ z0.b }, p0/z, [x0]
	ret
```

so the problem is purely limited to the case where the predicate is an unknown live-in for the block. I see what you mean about the ordering of lowering for masked_load_v16i8, i.e. we first see

```
Type-legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 14 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
      t16: v16i8 = sign_extend_inreg t4, ValueType:ch:v16i1
      t7: v16i8 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, $
    t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t16, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t9
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1

...

Vector-legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
        t21: v16i8 = AArch64ISD::VSHL t4, Constant:i32<7>
      t22: v16i8 = AArch64ISD::VASHR t21, Constant:i32<7>
      t7: v16i8 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, $
    t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t22, t7
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t9
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1
```

then

```
Legalized selection DAG: %bb.0 'masked_load_v16i8:'
SelectionDAG has 23 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t24: nxv16i1 = AArch64ISD::PTRUE TargetConstant:i32<9>
                t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1
              t21: v16i8 = AArch64ISD::VSHL t4, Constant:i32<7>
            t22: v16i8 = AArch64ISD::VASHR t21, Constant:i32<7>
          t27: nxv16i8 = insert_subvector undef:nxv16i8, t22, Constant:i64<0>
        t30: nxv16i1 = AArch64ISD::SETCC_MERGE_ZERO t24, t27, t28, setne:ch
      t31: nxv16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t30, t28
    t32: v16i8 = extract_subvector t31, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:v16i8 $q0, t32
  t28: nxv16i8 = splat_vector Constant:i32<0>
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v16i8 $q0, t11:1
```

It feels like a shame we're expanding the sign_extend_inreg so early on. I wonder if a cleaner solution is to fold `t16: v16i8 = sign_extend_inreg t4, ValueType:ch:v16i1` and `t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t16, t7` into this:

```
`t9: v16i8,ch = masked_load<(load unknown-size from %ir.src, align 8)> t0, t2, undef:i64, t4, t7`
```

That would remove the extends completely and hopefully lead to better codegen too, since it will also remove the VSHL. Can we do this in the DAG combine phase after `Type-legalized selection DAG: %bb.0 'masked_load_v16i8:'`. What do you think?

https://github.com/llvm/llvm-project/pull/157665