[PATCH] D69891: [VP,Integer,#1] Vector-predicated integer intrinsics

Mon Nov 18 10:59:28 PST 2019

simoll marked an inline comment as done.
simoll added inline comments.

================
Comment at: llvm/docs/LangRef.rst:14702
+The '``llvm.vp.add``' intrinsic performs integer addition (:ref:`add <i_add>`) of the first and second vector operand on each enabled lane.
+The result on disabled lanes is undefined.
+
----------------
sdesmalen wrote:
> cameron.mcinally wrote:
> > sdesmalen wrote:
> > > simoll wrote:
> > > > cameron.mcinally wrote:
> > > > > simoll wrote:
> > > > > > sdesmalen wrote:
> > > > > > > simoll wrote:
> > > > > > > > sdesmalen wrote:
> > > > > > > > > jdoerfert wrote:
> > > > > > > > > > simoll wrote:
> > > > > > > > > > > sdesmalen wrote:
> > > > > > > > > > > > Have you considered adding an extra argument to specify the value of the false lanes, similar to how this is currently done for `llvm.masked.load`? 
> > > > > > > > > > > > By passing `undef`, it would have similar behaviour as the current definition, yet it would remove the need to add explicit `select` statements for handling merging- or zeroing predication if an instruction supports it. For SVE for example, most instructions have either merging or zeroing predication and the intrinsics expose the merging/zeroing/undef predication directly in the C/C++ intrinsic API. I think it would be really nice to have that capability represented in the vp intrinsics as well.
> > > > > > > > > > > Yes i considered that. However, as you suggested you can express this without loss with a  `select` and i'd like to keep the intrinsics simple.
> > > > > > > > > > FWIW, I like the idea of simple intrinsics and explicit selection.
> > > > > > > > > If we want to keep the vp intrinsics simple, shouldn't we reconsider the need for `%evl` as an explicit parameter?
> > > > > > > > > 
> > > > > > > > > That value can be folded into the predicate using a special intrinsic that enables the lanes upto `%evl`, e.g.
> > > > > > > > > 
> > > > > > > > > `<256 x i1> llvm.vp.enable.lanes(<256 x i1> %p, i32 %evl)`
> > > > > > > > > 
> > > > > > > > > If any operation needs `%evl` as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.
> > > > > > > > > If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.
> > > > > > > > 
> > > > > > > > The SX-Aurora and RiSC-V V extension natively support an explicit vector length parameter just as they support predicated vector. Mask and EVL should be treated the same: if there is a vector mask parameter there should also be an EVL parameter.
> > > > > > > I understand that certain instructions take an explicit vector length parameter, but that doesn't really explain why an arch-independent intrinsic needs to expose this. Is there any reason that modelling this using something like `llvm.vp.enable.lanes` won't work?
> > > > > > That's already two architectures that do support EVL on **all** vector instructions and it is vital to model it. EVL has performance implications (smaller EVL -> lower latency; no need to compute and store a mask where EVL can be used) . For SVE/MVE, you can simply turn the feature off by passing `i32 -1`.
> > > > > > 
> > > > > > From the perspective of RVV and SX-Aurora, saying we do not need EVL is equivalent to saying we do not need vector predication.
> > > > > > Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.
> > > > > 
> > > > > Having an explicit `passthru` is a good idea. Keeping the select glued to the intrinsics may be difficult. For example:
> > > > > 
> > > > > `select %mask, %r, undef`
> > > > > 
> > > > > It would be easy for a select peep to replace this with %r, since the masked bits are undefined. But for a situation like a trapping instruction, we wouldn't want the the masked elements to execute.
> > > > > > select %mask, %r, undef
> > > > This is the current behavior - the values on masked-off lanes (by EVL or the mask) are `undef`. 
> > > > 
> > > > My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra `select` anyway.
> > > > 
> > > > Besides, if the select is explicit and can be folded into another operation to simplify it, then that is actually a good thing.
> > > > But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.
> > > If we'd use a `select` instruction to implement the zeroing/merging predication, the issue with the current set of intrinsics is that we don't know how many lanes to select because the predicate mask is no longer the determining factor to limit which lanes are the valid lanes. Suggesting that other architectures can just turn off the feature by setting the `%evl` parameter to `-1` doesn't change the fact that the code generator still needs to handle the IR if `%evl` is not `-1`. It would be valid IR and there is nothing that guarantees LLVM won't generate the generic intrinsics with a value that is not `-1`. This is not really special to `select`, but common to other operations that take a mask as well. For our case it means we can't implement zeroing/merging predication with this set of intrinsics using `select`.
> > > 
> > > If we want to solve the `select` issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like `@llvm.vp.enable.lanes`. Because this is an explicit intrinsic, the code-generator can simply extract the `%evl` parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like `masked.load`/`masked.gather` that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.
> > > 
> > > Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and `%evl` value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit `passthru` parameter.
> > Yes, you're right. Having an implicit undef and explicit merge is good.
> > 
> > My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.
> The most common passthru values in practice will be one of the source operands or zero/undef though. These are supported by SVE/SVE2 as a first-class feature (and I could imagine other/future vector architectures having special support for that as well). Other values can indeed be lowered to an explicit select instruction.
> 
> > But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.
> At that point the intrinsic will already be lowered to individual instructions, so I don't think the definition of the intrinsic has much of an impact on that.
> Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1

Other targets do not need to consider `%evl` because there will be an `ExpandVectorPredication` pass that folds the EVL into the vector mask. Backends can request through TTI that the `%evl` be folded away and ignore that parameter from there on.

An early version of that pass is included in the [VP reference patch](https://reviews.llvm.org/D57504). The TTI changes and the expansion pass are [planned for the next patch after this one](https://reviews.llvm.org/D57504#1735416).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69891/new/

https://reviews.llvm.org/D69891