[PATCH] D105119: [SVE] Fix incorrect codegen when inserting vector elements into widened scalable vectors

Mon Jul 5 13:48:40 PDT 2021

sdesmalen added a comment.

In D105119#2854045 <https://reviews.llvm.org/D105119#2854045>, @efriedma wrote:

>> For <vscale x 4 x i16> (assuming a native vector width of vscale x 128 bits for this example):
>
> There are two independent questions to consider: one, what layout we're using, and two, how we achieve that layout.

To address the first point, what layout we're using, for scalable vectors my understanding is that LLVM could assume three different layouts:

- FMT1: `<e0, _, e1 _ | e2, _, e3, _ | ... >` (elements unpacked in `#vscale` containers)
- FMT2: `<e0, e1, e2, e3 | _, _, _, _ | ... >` (elements packed in lower `#elts / (sizeof(container)/sizeof(elt))` containers)
- FMT3: `<e0, e1, _, _ | e2, e3, _, _ | ... >` (elements packed in lower part of each container)

We need to decide on a single layout throughout SelectionDAG, so that generic code can implement things like widening, as needed for e.g. INSERT_VECTOR_ELT (implemented in this patch).
>From the start we worked on the premise and requirement that LLVM would consider scalable vectors to always use an unpacked layout (FMT1).

- The two most fundamental operations in legalization, concatenating and splitting, have to be implemented efficiently.
  - Common shuffle instructions (de-interleave/unpack) can be used to efficiently implement the unpacked (FMT1) case.
  - For FMT2, the index in the vector where to split/concatenate can be anywhere within the vector, which is a lot more complicated than splitting/concatenating at the middle of the vector.
  - I guess FMT3 is the most nonsensical layout, since no load/store instructions would ever load data into such a layout. Doing things like concatenating two vectors or inserting/extracting elements would get very complicated.
- There are no legacy procedure call standards that dictate how partial scalable vector operands passed in/out of functions should be laid out, so we have freedom to choose a layout that is profitable without having to worry about conversions.
  - Scalable vector architectures are likely to always have full predication, so having all elements in a vector lined up makes working with mixed-sized elements a lot cheaper (i.e. add `zext(<vscale x 2 x i8>)` to `<vscale x 2 x i64>` doesn't require shuffling to line up the elements first and the extends may come for free.
  - When implementing fixed-width vectors with SVE we implement a packed format (FMT2) to match the PCS for Neon. Because the number of lanes is a constant, we can relatively easily map those types onto scalable vectors using a predicate to control only the active lanes (i.e. using `ptrue vl<N>` where N is a known a compile-time constant). For scalable vectors, we can't cheaply generate such a predicate, since we'd have to compute it using the runtime value of vscale divided by the ratio of sizeof(VL1)/sizeof(VL2). For fixed-width vectors we accept the extra cost of the extends/truncates in user code in favour of transforming between packed/unpacked formats, although this could still be reconsidered in the future if that turns out more profitable.

> For `<vscale x 4 x i16>`, we use TypePromoteInteger in legalization, to convert it to `<vscale x 4 x i32>`. TypePromoteInteger has the same meaning whether or not a vector is scalable.
>
> Say we have `<vscale x 1 x i64>`.  You're saying we need to ensure this has a layout where we alternate between legal and illegal elements.  Possible ways to achieve that layout:
>
> 1. Make `<vscale x 1 x i64>` a legal type, and use some combination of operation legalization/isel patterns to ensure the alternation.
> 2. Use TypePromoteInteger to `<vscale x 1 x i128>`.  But then we need to figure out how to legalize `<vscale x 1 x i128>`, and we'd need a different solution for floating-point types.
> 3. Define TypeWidenVector to appends undef elements to the end, but convert between the two layouts where the difference is externally visible (calling convention code).
> 4. Define TypeWidenVector to do some sort of interleaving.  I guess widening `<vscale x N x i64>` ->  `<vscale x M x i64>`, we insert M-N undef elements every N elements?
> 5. Come up with a new LegalizeTypeAction that represents what we want to do here.
>
> This patch is proposing (4), I think?  Have we made any other changes that imply (4)?  Have the other possibilities been discussed anywhere?

Given the above reasoning that LLVM implements FMT1, this would discard option 3 and would make option 5 unnecessary since `widening` has only a single meaning. 
Option (2) is not possible for SVE, because in order to promote the type, `<vscale x 1 x i128>` must be a legal type. There are also no instructions to promote `<vscale x 1 x float>` to `<vscale x 1 x f128>`.

That leaves options (1) and (4) as the only feasible options. Given that SelectionDAG already implements widening (and already just works for most cases), we figured we'd use that instead of adding this support manually and adding lots of new patterns. This also strengthens the generic support for scalable vectors in ISel.

I'm not sure in how many places in generic SelectionDAG code the layout concretely matters, since most operations are implemented with generic ISD nodes where most of that level of detail is abstracted away and is left to the target. This just happens to be one of those cases where the layout actively matters.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105119/new/

https://reviews.llvm.org/D105119