[PATCH] D57504: RFC: EVL Prototype & Roadmap for vector predication in LLVM

Mon Feb 4 14:42:53 PST 2019

rkruppe added a comment.

In D57504#1380587 <https://reviews.llvm.org/D57504#1380587>, @simoll wrote:

> In D57504#1380504 <https://reviews.llvm.org/D57504#1380504>, @programmerjake wrote:
>
> > We will also need to adjust gather/scatter and possibly other load/store kinds to allow the address vector length to be a divisor of the main vector length (similar to mask vector length). I didn't check if there are intrinsics for strided load/store, those will need to be changed too, to allow, for example, storing <scalable 3 x float> to var.v in:
>
>
> .. and as a side effect evl_load/evl_store are subsumed by evl_gather/evl_scatter:
>
>   evl.load(%p, %M, %L) ==  evl.gather(<1 x double*> %p, <256 x i1>..) ==  evl.gather(double* %p, <256 x i1> %M, i32 %L)

This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be `evl.gather(<1 x double*> %p, <scalable n x i1>)`, which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting `evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...)` where each mask bit controls `vscale` many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.

And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.

================
Comment at: include/llvm/IR/Intrinsics.td:1132
+                                LLVMMatchType<0>,
+                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
+                                llvm_i32_ty]>;
----------------
simoll wrote:
> rkruppe wrote:
> > simoll wrote:
> > > programmerjake wrote:
> > > > simoll wrote:
> > > > > programmerjake wrote:
> > > > > > We will need to change the mask parameter length to allow for mask lengths that are a divisor of the main vector length.
> > > > > > See http://lists.llvm.org/pipermail/llvm-dev/2019-February/129845.html
> > > > > Can we make the vector length operate at the granularity of the mask?
> > > > > 
> > > > > In your case [1] that would mean that the AVL refers to multiples of the short element vector (eg `<3 x float>`).
> > > > > 
> > > > > [1] http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html
> > > > I was initially assuming that the vector length would be in the granularity of the mask.
> > > > That would work for my ISA extension. I think that would work for the RISC-V V extension, would have to double check, or get someone who's working on it to check. I don't think that would work without needing to multiply the vector length on AVX512, assuming a shift is used to generate the mask. I have no clue for ARM SVE or other architectures.
> > > So we are on the same page here.
> > > 
> > > What i actually had in mind is how this would interact with scalable vectors,e.g.:
> > > 
> > >    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) 
> > > 
> > > In that case, the vector length should refer to packets of two elements. That would be a perfect match for NEC SX-Aurora, where AVL always refers to 64 bit elements (eg there is a packed float mode).
> > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), `<scalable 4 x i8>` is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> > 
> > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of `vscale` rather than in terms of vector elements. `<scalable n x T>` is simply an `(n * vscale)`-element vector and that the `vscale` factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to `(n * vscale) - 1`). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every `<scalable n x T>` with `<(CONST_VSCALE * n) x T>` and every call to `llvm.vscale()` with that constant.
> > 
> > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), <scalable 4 x i8> is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> Why couldn't you use <scalable 1 x i8> then?
> 
> > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of vscale rather than in terms of vector elements. <scalable n x T> is simply an (n * vscale)-element vector and that the vscale factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to (n * vscale) - 1). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every <scalable n x T> with <(CONST_VSCALE * n) x T> and every call to llvm.vscale() with that constant.
> Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.
> 
> > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> 
> You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.
> 
> Yes, lowering it otherwise may involve a shift (or adding a constant vector) in the worst case. However, all of this will happen on the legalization code path that is not expected to yield fast code but something that is correct and somehow reasonable.. we already do legalize things like llvm.masked.gather on SSE (and it ain't pretty).
> Why couldn't you use <scalable 1 x i8> then?

Each vector register holds a multiple of 32 bit (on that particular target), so `<scalable 4 x i8>` is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, `<scalable 4 x i8>`) versus having only half (`<scalable 2 x i8>`) or a quarter (`<scalable 1 x i8>`) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both `vsin16s(<scalable 1 x f16>)` and `vsin16d(<scalable 2 x f16>)`.

Additionally, I want to actually be able to actually use the full vector register without implementing a dynamically changing vscale. Not just because I'm lazy, but also because the architecture has changed enough that the motivation for it has become lessened, so maybe that will not be upstreamed (or only later).

> Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.

My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant. A vlen that is in units of `vscale` is thus unnatural and clashes with every other operation on scalable vectors. If we were talking about a family of intrinsics specifically targeted at the "vector of `float4`s" use case, that would be inherent and good, but we're not.

It's unfortunate that this clashes with how SX-Aurora's packed operations work, I did not know that.

> You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.

Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. `float4` types). Alternatively, use a dynamic vector length of `<the length their architecture wants> * <how many elements each of the short vectors has>` and fix it up in the backend.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504