[PATCH] D152205: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.
Sander de Smalen via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 23 03:15:01 PDT 2023
sdesmalen added inline comments.
================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25671-25672
+static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
+ unsigned MinSVESize,
+ unsigned MaxSVESize,
+ ArrayRef<int> ShuffleMask,
----------------
nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the Subtarget.
================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25722
+ TBLMask.push_back(DAG.getConstant(Val - NumElts, DL, MVT::i64));
+ std::swap(Op1, Op2);
+ } else {
----------------
This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you.
You can see that if you compile:
```
define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
%op0 = load <16 x i8>, ptr %ptr0
%op1 = load <16 x i8>, ptr %ptr1
; Takes only elements from second vector
%res = shufflevector <16 x i8> %op0, <16 x i8> %op1, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 31, i32 31>
ret <16 x i8> %res
}
```
Which results in:
```
t0: ch,glue = EntryToken
t4: i64,ch = CopyFromReg t0, Register:i64 %1
t8: v16i8,ch = load<(load (s128) from %ir.ptr1)> t0, t4, undef:i64
t10: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,15> t8, undef:v16i8
t12: ch,glue = CopyToReg t0, Register:v16i8 $q0, t10
t13: ch = AArch64ISD::RET_GLUE t12, Register:v16i8 $q0, t12:1
```
That means you don't need to modify the Shufflemask at all, and you can call `ShuffleVectorInst:: isSingleSourceMask(ShuffleMask)` to calculate `IsSingleOp`. If `isSingleSourceMask` returns `true`, then you can assume it will select elements from the first source operand. That will simplify this patch quite a bit!
================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25868
+ // of the shufflevector, thus we are rejecting this transform.
+ (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+ MinSVESize == MaxSVESize)) {
----------------
dtemirbulatov wrote:
> sdesmalen wrote:
> > This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.
> >
> > For example:
> >
> > ```
> > define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
> > %op0 = load <256 x i8>, ptr %ptr0
> > %op1 = load <256 x i8>, ptr %ptr1
> > ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
> > %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
> > ret <16 x i8> %res
> > }
> > ```
> >
> > However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.
> > ```
> > .LCPI0_0:
> > .byte 0 // 0x0
> > .byte 1 // 0x1
> > .byte 2 // 0x2
> > .byte 3 // 0x3
> > .byte 4 // 0x4
> > .byte 5 // 0x5
> > .byte 6 // 0x6
> > .byte 7 // 0x7
> > .byte 16 // 0x10
> > .byte 33 // 0x21
> > .byte 34 // 0x22
> > .byte 35 // 0x23
> > .byte 36 // 0x24
> > .byte 37 // 0x25
> > .byte 38 // 0x26
> > .byte 39 // 0x27
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .byte 255 // 0xff
> > .text
> > .globl foo
> > .p2align 2
> > .type foo, at function
> >
> > foo: // @foo
> > .cfi_startproc
> > // %bb.0:
> > adrp x8, .LCPI0_0
> > add x8, x8, :lo12:.LCPI0_0
> > ptrue p0.b
> > ldr q1, [x1]
> > ldr q0, [x0]
> > ld1b { z2.b }, p0/z, [x8]
> > tbl z0.b, { z0.b, z1.b }, z2.b
> > // kill: def $q0 killed $q0 killed $z0
> > ret
> > ```
> I think this is correct, here is full DAG dump:
> SelectionDAG has 16 nodes:
> t0: ch,glue = EntryToken
> t2: i64,ch = CopyFromReg t0, Register:i64 %0
> t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
> t4: i64,ch = CopyFromReg t0, Register:i64 %1
> t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
> t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
> t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
> t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
> t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
> t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1
>
It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence why the offsets are valid. This clearly wasn't the right example. The following case can't be transformed like that:
```define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
%op0 = load <256 x i8>, ptr %ptr0
%op1 = load <256 x i8>, ptr %ptr1
; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
%res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 252, i32 253, i32 254, i32 255, i32 256, i32 257, i32 258, i32 259, i32 508, i32 509, i32 510, i32 511>
ret <16 x i8> %res
}
```
It's only because SelectionDAG inserts the mask into a `<256 x i8>` vector that your code happens to discard this case (because that will fail the check `MaxOffset <= NumElts * 2`). I think I'd still prefer an explicit check that the maximum index value in the ShuffleMask doesn't exceed the maximum value of the element type.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D152205/new/
https://reviews.llvm.org/D152205
More information about the llvm-commits
mailing list