[PATCH] D57504: RFC: EVL Prototype & Roadmap for vector predication in LLVM

Tue Feb 5 02:22:30 PST 2019

simoll marked an inline comment as done.
simoll added inline comments.

================
Comment at: include/llvm/IR/Intrinsics.td:1132
+                                LLVMMatchType<0>,
+                                LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
+                                llvm_i32_ty]>;
----------------
rkruppe wrote:
> simoll wrote:
> > rkruppe wrote:
> > > simoll wrote:
> > > > programmerjake wrote:
> > > > > simoll wrote:
> > > > > > programmerjake wrote:
> > > > > > > We will need to change the mask parameter length to allow for mask lengths that are a divisor of the main vector length.
> > > > > > > See http://lists.llvm.org/pipermail/llvm-dev/2019-February/129845.html
> > > > > > Can we make the vector length operate at the granularity of the mask?
> > > > > > 
> > > > > > In your case [1] that would mean that the AVL refers to multiples of the short element vector (eg `<3 x float>`).
> > > > > > 
> > > > > > [1] http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html
> > > > > I was initially assuming that the vector length would be in the granularity of the mask.
> > > > > That would work for my ISA extension. I think that would work for the RISC-V V extension, would have to double check, or get someone who's working on it to check. I don't think that would work without needing to multiply the vector length on AVX512, assuming a shift is used to generate the mask. I have no clue for ARM SVE or other architectures.
> > > > So we are on the same page here.
> > > > 
> > > > What i actually had in mind is how this would interact with scalable vectors,e.g.:
> > > > 
> > > >    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L) 
> > > > 
> > > > In that case, the vector length should refer to packets of two elements. That would be a perfect match for NEC SX-Aurora, where AVL always refers to 64 bit elements (eg there is a packed float mode).
> > > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), `<scalable 4 x i8>` is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> > > 
> > > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of `vscale` rather than in terms of vector elements. `<scalable n x T>` is simply an `(n * vscale)`-element vector and that the `vscale` factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to `(n * vscale) - 1`). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every `<scalable n x T>` with `<(CONST_VSCALE * n) x T>` and every call to `llvm.vscale()` with that constant.
> > > 
> > > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> > > That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), <scalable 4 x i8> is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.
> > Why couldn't you use <scalable 1 x i8> then?
> > 
> > > Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of vscale rather than in terms of vector elements. <scalable n x T> is simply an (n * vscale)-element vector and that the vscale factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to (n * vscale) - 1). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every <scalable n x T> with <(CONST_VSCALE * n) x T> and every call to llvm.vscale() with that constant.
> > Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.
> > 
> > > I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.
> > 
> > You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.
> > 
> > Yes, lowering it otherwise may involve a shift (or adding a constant vector) in the worst case. However, all of this will happen on the legalization code path that is not expected to yield fast code but something that is correct and somehow reasonable.. we already do legalize things like llvm.masked.gather on SSE (and it ain't pretty).
> > Why couldn't you use <scalable 1 x i8> then?
> 
> Each vector register holds a multiple of 32 bit (on that particular target), so `<scalable 4 x i8>` is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, `<scalable 4 x i8>`) versus having only half (`<scalable 2 x i8>`) or a quarter (`<scalable 1 x i8>`) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both `vsin16s(<scalable 1 x f16>)` and `vsin16d(<scalable 2 x f16>)`.
> 
> Additionally, I want to actually be able to actually use the full vector register without implementing a dynamically changing vscale. Not just because I'm lazy, but also because the architecture has changed enough that the motivation for it has become lessened, so maybe that will not be upstreamed (or only later).
> 
> > Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.
> 
> My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant. A vlen that is in units of `vscale` is thus unnatural and clashes with every other operation on scalable vectors. If we were talking about a family of intrinsics specifically targeted at the "vector of `float4`s" use case, that would be inherent and good, but we're not.
> 
> It's unfortunate that this clashes with how SX-Aurora's packed operations work, I did not know that.
> 
> > You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.
> 
> Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. `float4` types). Alternatively, use a dynamic vector length of `<the length their architecture wants> * <how many elements each of the short vectors has>` and fix it up in the backend.
> Each vector register holds a multiple of 32 bit (on that particular target), so <scalable 4 x i8> is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, <scalable 4 x i8>) versus having only half (<scalable 2 x i8>) or a quarter (<scalable 1 x i8>) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both vsin16s(<scalable 1 x f16>) and vsin16d(<scalable 2 x f16>).
Makes sense. However, if VL is element-grained on IR-level then there need be functions in TTI to query the native VL grain size for the target (and per scalable type). Eg for SX-Aurora in packed float mode the grain size is <2 x float> and so you might want to generate a remainder loop in that case (unpredicated vector body + predicated vector body for the last iteration).

> My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant
>> Quote from "[llvm-dev] [RFC] Supporting ARM's SVE in LLVM", Graham Hunter:
>>
>>  To represent a vector of unknown length a scaling property is added to the `VectorType` class whose element count becomes an unknown multiple of a known minimum element count
>> <snip>
>>  A similar rule applies to vector floating point MVTs but those types whose static component is less that 128bits (MVT::nx2f32) are also mapped directly to SVE data registers but in a form whereby elements are effectively interleaved with enough undefined elements to fulfil the 128bit requirement.
>
I think the sub-vector interpretation is actually the more natural reading of SVE, considering that sub-vectors are padded/interleaved to fit a native sub-register size (128bit on SVE, 64bit on SX-Aurora and 32bit on RVV (in general, or just the RVV implementation you are working on?)). Each sub-vector in the full scalable type is offset by a multiple of that size, so a scalable type is an array of padded sub-vectors.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504