[PATCH] D57504: RFC: EVL Prototype & Roadmap for vector predication in LLVM

Wed Feb 6 04:57:11 PST 2019

simoll added a comment.

In D57504#1385823 <https://reviews.llvm.org/D57504#1385823>, @rkruppe wrote:

> In D57504#1384813 <https://reviews.llvm.org/D57504#1384813>, @simoll wrote:
>
> > In D57504#1384114 <https://reviews.llvm.org/D57504#1384114>, @rkruppe wrote:
> >
> > > This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be `evl.gather(<1 x double*> %p, <scalable n x i1>)`, which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting `evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...)` where each mask bit controls `vscale` many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.
> >
> >
> > Mixing vector types and scalable vector types is illegal and is not what i was suggesting. Rather, a scalar pointer would be passed to convey a consecutive load/store from a single address.
>
>
> Ok, sorry, I misunderstood the proposal then. That seems reasonable, although I remain unsure about the benefits.
>
> >> And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.
> > 
> > //Less readable IR:// the address computation would become simpler, eg there is no need to synthesize a consecutive constant only to have it pattern-matched and subsumed in the backend (eg <0, 1, 2, ..., 15, 0, 1, 2, 3.., 15, 0, ....>).
>
> I don't really follow, here we were only talking about subsuming unit-stride loads under gathers (and likewise for stores/scatters), right? But anyway, I was mostly worried about a dummy 1-element vector and the extra type parameter(s) on the intrinsic, which isn't an issue with what you actually proposed.
>
> > //Finicky pattern matching:// it is trivial to legalize this by expanding it into a more standard gather/scatter, or splitting it into consecutive memory accesses.. we can even keep the EVL_LOAD, EVL_STORE SDNodes in the backend so you woudn't even realize that an llvm.evl.gather was used for a consecutive load on IR level.
>
> I'm not talking about legalizing a unit-stride access into a gather/scatter (when would that be necessary? everyone who has scatter/gather also has unit-stride), but about recognizing that the "gather" or "scatter" is actually unit strided. Having separate SDNodes would solve that by handling it once for all targets in SelectionDAG construction/combines.
>
> > //More complexity on IR passes for standard loads:// We are already using intrinsics here and i'd rather advocate to push for a solution that makes general passes work on predicated gather/scatter (which is not the case atm, afaik).
>
> I expect that quite a few optimizations and analyses that work for plain old loads don't work for gathers, or have to work substantially harder to work on gathers, maybe even with reduced effectiveness. I do want scatters and gathers to be optimized well, too, but I fear we'll instead end up with a pile of "do one thing if this `gather` is a unit-stride access, else do something different [or nothing]".
>
> How does this plane interact with the later stages of the roadmap? At stage 5, a `{Load,Store}Inst` with vlen and mask is a unit-stride access, and gathers are left out in the rain, unless you first generalize load and store instructions to general gathers and scatters (which seems a bit radical).
>
> > But again, we can leave this out of the first version and keep discussing as an extension.
>
> Yeah, this is a question of what's the best canonical form for unit-stride memory accesses, that can be debated when the those actually exist in tree.

I was referring to the generalized gather, for example:

  evl.gather.nxv4f32(<scalable 1 x float*> %Ptr, <scalable 4 x i1> %Mask., ..)

This would load sub-vectors of <4 x float> size from each element pointer in the %Ptr vector. Consecutive (unit strided) loads just happen to be a corner case of the generalized gather. In particular the remark about emitting verbose address computation codes just to pattern match them later should be seen in that light, eg see https://lists.llvm.org/pipermail/llvm-dev/2019-February/129942.html.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504