[PATCH] D69891: [VP,Integer,#1] Vector-predicated integer intrinsics

Mon Nov 25 02:43:02 PST 2019

simoll marked 2 inline comments as done.
simoll added a comment.

Moved the discussion to the RFC <https://reviews.llvm.org/D57504#1758546>.

================
Comment at: llvm/docs/LangRef.rst:14702
+The '``llvm.vp.add``' intrinsic performs integer addition (:ref:`add <i_add>`) of the first and second vector operand on each enabled lane.
+The result on disabled lanes is undefined.
+
----------------
rkruppe wrote:
> SjoerdMeijer wrote:
> > simoll wrote:
> > > sdesmalen wrote:
> > > > cameron.mcinally wrote:
> > > > > sdesmalen wrote:
> > > > > > simoll wrote:
> > > > > > > cameron.mcinally wrote:
> > > > > > > > simoll wrote:
> > > > > > > > > sdesmalen wrote:
> > > > > > > > > > simoll wrote:
> > > > > > > > > > > sdesmalen wrote:
> > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > simoll wrote:
> > > > > > > > > > > > > > sdesmalen wrote:
> > > > > > > > > > > > > > > Have you considered adding an extra argument to specify the value of the false lanes, similar to how this is currently done for `llvm.masked.load`? 
> > > > > > > > > > > > > > > By passing `undef`, it would have similar behaviour as the current definition, yet it would remove the need to add explicit `select` statements for handling merging- or zeroing predication if an instruction supports it. For SVE for example, most instructions have either merging or zeroing predication and the intrinsics expose the merging/zeroing/undef predication directly in the C/C++ intrinsic API. I think it would be really nice to have that capability represented in the vp intrinsics as well.
> > > > > > > > > > > > > > Yes i considered that. However, as you suggested you can express this without loss with a  `select` and i'd like to keep the intrinsics simple.
> > > > > > > > > > > > > FWIW, I like the idea of simple intrinsics and explicit selection.
> > > > > > > > > > > > If we want to keep the vp intrinsics simple, shouldn't we reconsider the need for `%evl` as an explicit parameter?
> > > > > > > > > > > > 
> > > > > > > > > > > > That value can be folded into the predicate using a special intrinsic that enables the lanes upto `%evl`, e.g.
> > > > > > > > > > > > 
> > > > > > > > > > > > `<256 x i1> llvm.vp.enable.lanes(<256 x i1> %p, i32 %evl)`
> > > > > > > > > > > > 
> > > > > > > > > > > > If any operation needs `%evl` as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.
> > > > > > > > > > > > If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.
> > > > > > > > > > > 
> > > > > > > > > > > The SX-Aurora and RiSC-V V extension natively support an explicit vector length parameter just as they support predicated vector. Mask and EVL should be treated the same: if there is a vector mask parameter there should also be an EVL parameter.
> > > > > > > > > > I understand that certain instructions take an explicit vector length parameter, but that doesn't really explain why an arch-independent intrinsic needs to expose this. Is there any reason that modelling this using something like `llvm.vp.enable.lanes` won't work?
> > > > > > > > > That's already two architectures that do support EVL on **all** vector instructions and it is vital to model it. EVL has performance implications (smaller EVL -> lower latency; no need to compute and store a mask where EVL can be used) . For SVE/MVE, you can simply turn the feature off by passing `i32 -1`.
> > > > > > > > > 
> > > > > > > > > From the perspective of RVV and SX-Aurora, saying we do not need EVL is equivalent to saying we do not need vector predication.
> > > > > > > > > Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.
> > > > > > > > 
> > > > > > > > Having an explicit `passthru` is a good idea. Keeping the select glued to the intrinsics may be difficult. For example:
> > > > > > > > 
> > > > > > > > `select %mask, %r, undef`
> > > > > > > > 
> > > > > > > > It would be easy for a select peep to replace this with %r, since the masked bits are undefined. But for a situation like a trapping instruction, we wouldn't want the the masked elements to execute.
> > > > > > > > > select %mask, %r, undef
> > > > > > > This is the current behavior - the values on masked-off lanes (by EVL or the mask) are `undef`. 
> > > > > > > 
> > > > > > > My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra `select` anyway.
> > > > > > > 
> > > > > > > Besides, if the select is explicit and can be folded into another operation to simplify it, then that is actually a good thing.
> > > > > > > But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.
> > > > > > If we'd use a `select` instruction to implement the zeroing/merging predication, the issue with the current set of intrinsics is that we don't know how many lanes to select because the predicate mask is no longer the determining factor to limit which lanes are the valid lanes. Suggesting that other architectures can just turn off the feature by setting the `%evl` parameter to `-1` doesn't change the fact that the code generator still needs to handle the IR if `%evl` is not `-1`. It would be valid IR and there is nothing that guarantees LLVM won't generate the generic intrinsics with a value that is not `-1`. This is not really special to `select`, but common to other operations that take a mask as well. For our case it means we can't implement zeroing/merging predication with this set of intrinsics using `select`.
> > > > > > 
> > > > > > If we want to solve the `select` issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like `@llvm.vp.enable.lanes`. Because this is an explicit intrinsic, the code-generator can simply extract the `%evl` parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like `masked.load`/`masked.gather` that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.
> > > > > > 
> > > > > > Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and `%evl` value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit `passthru` parameter.
> > > > > Yes, you're right. Having an implicit undef and explicit merge is good.
> > > > > 
> > > > > My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.
> > > > The most common passthru values in practice will be one of the source operands or zero/undef though. These are supported by SVE/SVE2 as a first-class feature (and I could imagine other/future vector architectures having special support for that as well). Other values can indeed be lowered to an explicit select instruction.
> > > > 
> > > > > But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.
> > > > At that point the intrinsic will already be lowered to individual instructions, so I don't think the definition of the intrinsic has much of an impact on that.
> > > > Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1
> > > 
> > > Other targets do not need to consider `%evl` because there will be an `ExpandVectorPredication` pass that folds the EVL into the vector mask. Backends can request through TTI that the `%evl` be folded away and ignore that parameter from there on.
> > > 
> > > An early version of that pass is included in the [VP reference patch](https://reviews.llvm.org/D57504). The TTI changes and the expansion pass are [planned for the next patch after this one](https://reviews.llvm.org/D57504#1735416).
> > Don't think this discussion reached consensus, and I missed the nuances here earlier / changed my mind on it.
> > 
> > The prior art on this also uses an explicit `passthru`, and it looks more generic than patching that up with an `select` later; you can always pass an `undef` in?
> > 
> > I also read the description of %evl on line 14664 above again, and thought the definition could be tightened a bit, i.e. specify an upperbound, also just for clarity:
> > 
> > > The explicit vector length parameter always has the type `i32` and is a signed integer value.  The explicit vector length is only effective if it is non-negative.  Results are only computed for enabled lanes.  A lane is enabled if the mask at that position is true and, if effective, where the lane position is below the explicit vector length.
> > 
> > And then started drafting something to see if I could describe the proposed behaviour:
> > 
> > The explicit vector length parameter always has the type `i32` and is a signed integer value. The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, the effective vector length is used to calculate a mask, and its value is:
> > 
> >   0 <= evl <= W,   where W is the vector length.
> > 
> > This creates a mask, `%EVLmask`, with all elements `0 <= i <= evl` set to `True`, and all other lanes `evl < i <= W` to `False`.
> > 
> > A new mask `%M` is calculated with an element-wise AND from `%mask` and `%EVLmask`:
> > 
> >   M = Vmask AND EVLmask
> > 
> > A vector operation `<opcode>` on vectors `A` and `B` calculates:
> > 
> >   A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
> >                   {  undef otherwise
> > 
> > and if I'm not mistaken we are now discussing if `undef` here should be `undef` or a `passthru`
> Sjoerd's summary matches my understanding of the question here.
> 
> I believe so far nobody spelled out that there's actually a material difference between a `passthru` argument and a separate `select`. The mask that is readily available for the `select` (the `%mask` passed to the VP instruction) isn't the same as the "enabled lanes" computed from mask and EVL (`%M` in Sjoerd's exposition above), which is extremely undesirable to materialize on machines with a native equipvalent of EVL (such as SX-Aurora and RISC-V).
> 
> In particular, if `%mask` is computed by another VP instruction, then lanes of `%mask` past the relevant EVL are undef, not false. That means a select like `select %mask, %v, zeroinitializer` does not actually make the disabled lanes zero, past EVL they are again undef (assuming `%x` and `%mask` were computed by VP instructions with the same EVL). In contrast, a passthru value would (presumably, I didn't see anyone object) apply to all disabled lanes.
> 
> I don't know how much this matters, since usually consumers of the vector will also only work up to the same EVL. But there are exceptions, such as the code pattern for vectorizing (associative or `-ffast-math`) reductions for RISC-V V. Briefly put, that pattern is to accumulate a vector of partial (e.g.) sums in the loop, preserving lanes past the current EVL (which may be smaller than the EVL in some previous iterations) so they'll still participate in the final reduction after the loop. This can be expressed perfectly with `passthru`.
> 
> Without `passthru`, that kind of code seems to require juggling two EVLs (the actual current EVL for this iteration, and the maximum EVL across all iterations) as well as materializing what Sjoerd called `%EVLmask` and `%M` above. This seems much harder for a backend to pattern-match and turn into acceptable code than the `select %m, vp.something(%a, %b, %m, -1), zeroinitializer` pattern.
> 
> I previously felt that `passthru` would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use `select`" is workable.
> 
> That doesn't necessarily mean we need `passthru` arguments on every VP instruction, I could also imagine adding a `vp.passthru(%v, %passthru, %mask, %evl)` intrinsic that takes from the first operand for all enabled lanes and the second for the other lanes. But I feel like once the floodgates are opened for doing merging "differently", it's kind of a toss-up whether it's simpler to create a special intrinsic for it or make it part of the general recipe for VP instructions.
    A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                    {  undef otherwise
That's an accurate description of `%evl` semantics.

================
Comment at: llvm/docs/LangRef.rst:14715
+
+      %r = call <4 x i32> @llvm.vp.add.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i1> %mask, i32 %avl)
+      ;; For all lanes below %avl, %r is lane-wise equivalent to %also.r
----------------
SjoerdMeijer wrote:
> nit: perhaps more consistent  `%avl`   ->   `%evl` ?
Yes. I'll change that to `%evl`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69891/new/

https://reviews.llvm.org/D69891