[PATCH] D57504: RFC: Prototype & Roadmap for vector predication in LLVM

Tue Feb 4 10:03:11 PST 2020

programmerjake added a comment.

>From what I recall, the plan is to implement this by using fixed-size vector types combined with VL-based ops. MVL would be the size of those vector types.

Quoting all of lkcl's email so it ends up in Phabricator:

On Tue, Feb 4, 2020 at 3:48 AM @lkcl wrote:

> In D57504#1856586 <https://reviews.llvm.org/D57504#1856586>, @simoll wrote:
>
> > In D57504#1856207 <https://reviews.llvm.org/D57504#1856207>, @andrew.w.kaylor wrote:
> >
> > > In D57504#1854330 <https://reviews.llvm.org/D57504#1854330>, @simoll wrote:
> > >
> > > > Exactly. The VE target strictly requires `VL <= MVL` or you'll get a
> > > >  hardware exception. Enforcing strict UB here means VP-users have to
> > > >  explicitly drop instructions that keep the VL within bounds. This means
> > > >  that we can optimize the VL computation code and that it can be factored
> > > >  into cost calculations, etc. With Options 2 & 3 this would happen only
> > > >  very late in the backend when most scalar optimizations are already
> > > >  done.
> > >
> > >
> > > I think I'm lost here. Which thing is VL and which is MVL in this
> > >  scenario?
> >
> >
> > VL == %evl
> >  MVL == W
> >  Sorry for the vector speak :)
>
>
> ah.  right.  that bit of information was important, simon :)   without
>  clarification, i assumed W was the "required vector length at the
>  program loop level", whoops..
>
> > I agree that, in the end, the semantics will be based solely on IR-types.
> >  However, what that semantics should look like for the `%evl > W` case
> >  depends on the way targets can handle this to make sure that whatever we
> >  specify on IR-level is at least reasonable for all targets.
>
> okaaay, riight, so the purpose of the discussion is, e.g., to work out
>  how to represent things like for-loops in the strcpy example here, is
>  that right?
>
> https://www.sigarch.org/simd-instructions-considered-harmful/
>
> so %evl > W (i.e. %evl > MVL) in RVV, it is the very effort of trying
>  to *set* %evl to the loop length, this is retried *in every loop*.
>  and the implementation (in hardware) very very specifically -
>  unbeknownst to the programmer (and to the IR writer) - hard-limits
>  %evl *to* MVL.
>
> to be clear: although the programmer *tries* to set %evl > MVL, this
>  *never happens*: %evl will *always* be actually set to <= MVL.
>
> it's quite clever.
>
> it is really really important - a critical part of the design of RVV
>  loops - that the programmer (or LLVM compiler developer in this case)
>  *not* even know or make any assumptions about what MVL will be.  some
>  hardware will actually have MVL equal to 1.  some really unbelievably
>  powerful and stupidly expensive hardware might have MVL equal to 65536
>  (yes really, 65536 wide vector ALUs) and the critical thing is, the
>  assembly code *does not care*.  it still works perfectly on both,
>  despite the fact that you have no idea, really, what value MVL is
>  going to be.
>
> SimpleV is different in that you absolutely must explicitly declare,
>  as part of any assembly loops (or any other instructions), precisely
>  and exactly how large MVL is to be.  this is because it is an
>  "allocation of the number of **scalar** registers - from the *scalar*
>  regfile - to be used for the vector operation".
>
> thus, for SimpleV, we do actually need a way in LLVM to represent
>  (set) MVL, because it is quite literally an "explicit reservation of a
>  certain size and number of registers".
>
> think of it as a way to say "hey y'know these upcoming SIMD
>  instructions? yeah, we need to set them to all be of length 8 for this
>  set.  then, like, next we need to set all the upcoming SIMD
>  instructions to 16, y'ken".  actually they're not SIMD they're
>  vector-ops but you get the idea.
>
> this we do with an *extra* parameter to the SV.SETVL instruction
>  https://libre-riscv.org/simple_v_extension/appendix/#index8h1
>
> SV.SETVL a2, t4, 8 # MVL==8
>
> now, *if* we have a way to set MVL (through LLVM-IR), we can *also*
>  use that for doing saving/restoring of entire scalar register files
>  with a single instruction, as well as use it for function call
>  register stack save/restore.
>
> basically when we have control over MVL through LLVM-IR, we get a
>  "LD.MULTI" and "ST.MULTI" instruction "for free" as an accidental
>  side-benefit.
>
> SV.SETMVL #32    ; tells the hardware that vector operations are to
>  use 32 *scalar* regs
>  SV.LD a0, f0, #8     ; loads registers f0 thru f31 from the address at (a0+8)
>
> for SIMD systems such as x86 and ARM, the only way to keep loops as
>  simple as RVV and SV, you'd need an instruction which, when you got to
>  the last run through the loop, then whilst %evl would be set to some
>  fixed-width-at-the-SIMD-boundary, some predicate mask was set up
>  *instead*... and thus despite the SIMD operation still being 4 (or 8,
>  or 16), the elements at the end were left alone (masked out)
>
> without such an instruction (one which sets up the predicate bitmask
>  as not being all 1s on the last loop) you'd have to have a sequence of
>  instructions that effectively do the same job, and those instructions
>  will, clearly, impact performance due to them being executed on each
>  and every loop.
>
> this is, unless the above is expressly supported in a single
>  instruction (one equivalent to SETVL
>  which sets up the predicate mask on the last loop) i am sorry to have
>  to use this particular phrase, a dog's dinner approach when compared
>  to variable-run vectorisation, and it's why i keep warning that
>  attempting to add support for fixed-power-of-two-%evl in this proposal
>  is not a good idea.
>
> even if you _do_ have such an instruction (or a really really short
>  sequence that's equivalent and does not impact the length of the loop
>  too badly), the fact that the assembly code has to use 16 wide SIMD if
>  you want to do high-performance but then if you have short loops you
>  are wasting ALU resources but if you use 4 wide SIMD to stop wasting
>  ALU resources you can't do high-performance, you are screwed both
>  coming and going, and, ultimately, have to resort to stripmining to
>  properly solve it, and at that point we're *definitely* outside of the
>  scope of this proposal [as i understand it].
>
> l.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504