[PATCH] D57504: RFC: EVL Prototype & Roadmap for vector predication in LLVM

Tue Feb 5 12:06:26 PST 2019

rkruppe added a comment.

In D57504#1384813 <https://reviews.llvm.org/D57504#1384813>, @simoll wrote:

> In D57504#1384114 <https://reviews.llvm.org/D57504#1384114>, @rkruppe wrote:
>
> > This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be `evl.gather(<1 x double*> %p, <scalable n x i1>)`, which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting `evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...)` where each mask bit controls `vscale` many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.
>
>
> Mixing vector types and scalable vector types is illegal and is not what i was suggesting. Rather, a scalar pointer would be passed to convey a consecutive load/store from a single address.

Ok, sorry, I misunderstood the proposal then. That seems reasonable, although I remain unsure about the benefits.

>> And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.
> 
> //Less readable IR:// the address computation would become simpler, eg there is no need to synthesize a consecutive constant only to have it pattern-matched and subsumed in the backend (eg <0, 1, 2, ..., 15, 0, 1, 2, 3.., 15, 0, ....>).

I don't really follow, here we were only talking about subsuming unit-stride loads under gathers (and likewise for stores/scatters), right? But anyway, I was mostly worried about a dummy 1-element vector and the extra type parameter(s) on the intrinsic, which isn't an issue with what you actually proposed.

> //Finicky pattern matching:// it is trivial to legalize this by expanding it into a more standard gather/scatter, or splitting it into consecutive memory accesses.. we can even keep the EVL_LOAD, EVL_STORE SDNodes in the backend so you woudn't even realize that an llvm.evl.gather was used for a consecutive load on IR level.

I'm not talking about legalizing a unit-stride access into a gather/scatter (when would that be necessary? everyone who has scatter/gather also has unit-stride), but about recognizing that the "gather" or "scatter" is actually unit strided. Having separate SDNodes would solve that by handling it once for all targets in SelectionDAG construction/combines.

> //More complexity on IR passes for standard loads:// We are already using intrinsics here and i'd rather advocate to push for a solution that makes general passes work on predicated gather/scatter (which is not the case atm, afaik).

I expect that quite a few optimizations and analyses that work for plain old loads don't work for gathers, or have to work substantially harder to work on gathers, maybe even with reduced effectiveness. I do want scatters and gathers to be optimized well, too, but I fear we'll instead end up with a pile of "do one thing if this `gather` is a unit-stride access, else do something different [or nothing]".

How does this plane interact with the later stages of the roadmap? At stage 5, a `{Load,Store}Inst` with vlen and mask is a unit-stride access, and gathers are left out in the rain, unless you first generalize load and store instructions to general gathers and scatters (which seems a bit radical).

> But again, we can leave this out of the first version and keep discussing as an extension.

Yeah, this is a question of what's the best canonical form for unit-stride memory accesses, that can be debated when the those actually exist in tree.

================
Comment at: include/llvm/IR/Intrinsics.td:1132
+                                LLVMMatchType<0>,
+                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
+                                llvm_i32_ty]>;
----------------
simoll wrote:
> rkruppe wrote:
> > simoll wrote:
> > > rkruppe wrote:
> > > > simoll wrote:
> > > > > programmerjake wrote:
> > > > > > simoll wrote:
> > > > > > > programmerjake wrote:
> > > > > > > > We will need to change the mask parameter length to allow for mask lengths that are a divisor of the main vector length.
> > > > > > > > See http://lists.llvm.org/pipermail/llvm-dev/2019-February/129845.html
> > > > > > > Can we make the vector length operate at the granularity of the mask?
> > > > > > > 
> > > > > > > In your case [1] that would mean that the AVL refers to multiples of the short element vector (eg `<3 x float>`).
> > > > > > > 
> > > > > > > [1] http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html
> > > > > > I was initially assuming that the vector length would be in the granularity of the mask.
> > > > > > That would work for my ISA extension. I think that would work for the RISC-V V extension, would have to double check, or get someone who's working on it to check. I don't think that would work without needing to multiply the vector length on AVX512, assuming a shift is used to generate the mask. I have no clue for ARM SVE or other architectures.
> > > > > So we are on the same page here.
> > > > > 
> > > > > What i actually had in mind is how this would interact with scalable vectors,e.g.:
> > > > > 
> > > > >    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) 
> > > > > 
> > > > > In that case, the vector length should refer to packets of two elements. That would be a perfect match for NEC SX-Aurora, where AVL always refers to 64 bit elements (eg there is a packed float mode).
> > > > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), `<scalable 4 x i8>` is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> > > > 
> > > > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of `vscale` rather than in terms of vector elements. `<scalable n x T>` is simply an `(n * vscale)`-element vector and that the `vscale` factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to `(n * vscale) - 1`). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every `<scalable n x T>` with `<(CONST_VSCALE * n) x T>` and every call to `llvm.vscale()` with that constant.
> > > > 
> > > > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> > > > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), <scalable 4 x i8> is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> > > Why couldn't you use <scalable 1 x i8> then?
> > > 
> > > > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of vscale rather than in terms of vector elements. <scalable n x T> is simply an (n * vscale)-element vector and that the vscale factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to (n * vscale) - 1). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every <scalable n x T> with <(CONST_VSCALE * n) x T> and every call to llvm.vscale() with that constant.
> > > Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.
> > > 
> > > > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> > > 
> > > You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.
> > > 
> > > Yes, lowering it otherwise may involve a shift (or adding a constant vector) in the worst case. However, all of this will happen on the legalization code path that is not expected to yield fast code but something that is correct and somehow reasonable.. we already do legalize things like llvm.masked.gather on SSE (and it ain't pretty).
> > > Why couldn't you use <scalable 1 x i8> then?
> > 
> > Each vector register holds a multiple of 32 bit (on that particular target), so `<scalable 4 x i8>` is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, `<scalable 4 x i8>`) versus having only half (`<scalable 2 x i8>`) or a quarter (`<scalable 1 x i8>`) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both `vsin16s(<scalable 1 x f16>)` and `vsin16d(<scalable 2 x f16>)`.
> > 
> > Additionally, I want to actually be able to actually use the full vector register without implementing a dynamically changing vscale. Not just because I'm lazy, but also because the architecture has changed enough that the motivation for it has become lessened, so maybe that will not be upstreamed (or only later).
> > 
> > > Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.
> > 
> > My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant. A vlen that is in units of `vscale` is thus unnatural and clashes with every other operation on scalable vectors. If we were talking about a family of intrinsics specifically targeted at the "vector of `float4`s" use case, that would be inherent and good, but we're not.
> > 
> > It's unfortunate that this clashes with how SX-Aurora's packed operations work, I did not know that.
> > 
> > > You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.
> > 
> > Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. `float4` types). Alternatively, use a dynamic vector length of `<the length their architecture wants> * <how many elements each of the short vectors has>` and fix it up in the backend.
> > Each vector register holds a multiple of 32 bit (on that particular target), so <scalable 4 x i8> is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, <scalable 4 x i8>) versus having only half (<scalable 2 x i8>) or a quarter (<scalable 1 x i8>) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both vsin16s(<scalable 1 x f16>) and vsin16d(<scalable 2 x f16>).
> Makes sense. However, if VL is element-grained on IR-level then there need be functions in TTI to query the native VL grain size for the target (and per scalable type). Eg for SX-Aurora in packed float mode the grain size is <2 x float> and so you might want to generate a remainder loop in that case (unpredicated vector body + predicated vector body for the last iteration).
> 
> 
> > My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant
> >> Quote from "[llvm-dev] [RFC] Supporting ARM's SVE in LLVM", Graham Hunter:
> >>
> >>  To represent a vector of unknown length a scaling property is added to the `VectorType` class whose element count becomes an unknown multiple of a known minimum element count
> >> <snip>
> >>  A similar rule applies to vector floating point MVTs but those types whose static component is less that 128bits (MVT::nx2f32) are also mapped directly to SVE data registers but in a form whereby elements are effectively interleaved with enough undefined elements to fulfil the 128bit requirement.
> >
> I think the sub-vector interpretation is actually the more natural reading of SVE, considering that sub-vectors are padded/interleaved to fit a native sub-register size (128bit on SVE, 64bit on SX-Aurora and 32bit on RVV (in general, or just the RVV implementation you are working on?)). Each sub-vector in the full scalable type is offset by a multiple of that size, so a scalable type is an array of padded sub-vectors.
> Makes sense. However, if VL is element-grained on IR-level then there need be functions in TTI to query the native VL grain size for the target (and per scalable type). Eg for SX-Aurora in packed float mode the grain size is <2 x float> and so you might want to generate a remainder loop in that case (unpredicated vector body + predicated vector body for the last iteration).

Yes, this difference affects how you should vectorize, so it needs to be in TTI in some form.

> I think the sub-vector interpretation is actually the more natural reading of SVE, considering that sub-vectors are padded/interleaved to fit a native sub-register size (128bit on SVE, 64bit on SX-Aurora and 32bit on RVV (in general, or just the RVV implementation you are working on?)). Each sub-vector in the full scalable type is offset by a multiple of that size, so a scalable type is an array of padded sub-vectors.

This behavior does not privilege or require a concept of "subvectors" (just the total number of elements/bits having a certain factor). It can be framed that way if you insist, but can just as naturally be framed as interleaving //the whole vector// or by "padding each element". The latter is an especially useful perspective for code handling mixed element types because there it's important that each 32 bit float lines up with the corresponding elements of vectors with larger and smaller elements.

PS: for integers the "pad each element" perspective is particularly strong, because e.g. `<scalable 1 x i8>` is literally just promoted to `<scalable 1 x i32>` (on RVV, different numbers on other architectures). No undef "interleaving" with anything, just straight up sign extension.

> 32bit on RVV (in general, or just the RVV implementation you are working on?)

It depends on the machine and target triple. At minimum it's the largest support element type, and some environments ("platform specs" in RISC-V jargon) could demand more from supported hardware/guarantee more to software. The relevant parameters are called ELEN and VLEN in the spec.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504