[PATCH] D70542: [AArch64][SVE] Add intrinsics for gather loads with 64-bit offsets

Tue Nov 26 07:11:51 PST 2019

sdesmalen added inline comments.

================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll:15
+; CHECK-NEXT: ret
+  %load = call <vscale x 2 x i8> @llvm.aarch64.sve.ld1.gather.nxv2i8(<vscale x 2 x i1> %pg,
+                                                                     i8* %base,
----------------
efriedma wrote:
> andwar wrote:
> > efriedma wrote:
> > > This doesn't match the way the corresponding C intrinsics are defined in the ACLE spec.  Are you intentionally diverging here?
> > Thank you for taking a look @efriedma! Yes, this is intentional. Well spotted!
> > 
> > If we used `<nxv2i64>` as the return type here then we wouldn't need `zext`. However, we'd need some other way to differentiate between `ld1b` and `ld1sb` later. That would basically  double the number of intrinsics. We felt that leaving the `sign/zero` here (to be folded using a simple DAGCombine) is a good compromise. I will be upstreaming that code shortly.
> > 
> > I should also point out the we have more intrinsics like this to upstream - this patch covers only 2 addressing modes.
> My only concern with this is that it means we're committing to a specific layout for these types: specifically, that converting from `<vscale x 2 x i8>` to `<vscale x 2 x i64>` using an ANY_EXTEND is a no-op.  (Otherwise, we would be forced to generate some very nasty code to lower a load that isn't immediately extended.)
> 
> For other targets, we've found that doing legalization by promoting the integer type isn't the best approach.  For example, on x86, we used to legalize `<2 x i16>` by promoting it to `<2 x i64>`.  But that was changed to widen it to `<8 x i16>`, where the padding is all at the end, because the operations were  generally cheaper.
> 
> Maybe we could implement some hybrid approach to legalization, though, that widens `<vscale 2 x i8>` to `<vscale x 16 x i8>`, but interleaves the padding so the conversion to `<vscale 2 x i64>` is just a bitcast.
Using the unpacked layout for scalable vectors is a conscious decision for SVE. The most important reason for that is the ability to generate a predicate; there is no `ptrue` instruction to generate an all true predicate for a packed `<vscale x 8 x i16>` vector that sets the bottom `vscale x 8` lanes to `true`. That means we'd need to evaluate a whole compare expression to create the mask. Because a predicate must have a consistent meaning across all of its uses, we'll also need to add instructions to unpack the predicate in order for it to have the same meaning across operations that work on different element sizes. Other operations like extracting the bottom/high half of a vector would also become more expensive since we can't use `zip1`/`zip2` instructions for that any more.

Unlike other targets, SVE's full predication makes it much less likely that operating on unpacked vectors is more expensive, so this seemed like the right design choice.

It's worth pointing out that this requirement is specific to scalable vectors and does not necessarily translate to fixed-width vectors like `<2 x i32>` mapped to SVE, because `ptrue` has patterns to generate a mask for an explicit vector length (although that would still require multiple predicates if the loop uses different element sizes).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70542/new/

https://reviews.llvm.org/D70542