[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Mon Nov 9 07:03:02 PST 2020

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
For VE, we want to do as much predication as possible through %evl and as little as possible with %mask. This has performance implications on VE and RISC-V - VE does not generate a mask from %evl but %evl is directly mapped to hardware, passing the all-true mask is free.
So for VE, the %evl does all the predication and there is no reason to have anything other than a (splat i1 1) %mask here.
Okay, got it. One way to look at this is that (splat i1 1) is just a special case of get.active.lane.mask, for example get.mask(%i, 0) can trivially be expanded/lowered to a (splat i1 1). This is not terribly important, but shows that get.active.lane.mask could be used for all targets I think; we don't need many cases. And kind of similarly, vscale can be a no-op or do something.

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 06 November 2020 15:37
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>; Roger Ferrer Ibáñez <rofirrim at gmail.com>
Cc: Renato Golin <rengolin at gmail.com>; Vineet Kumar <vineet.kumar at bsc.es>; LLVM Dev <llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ <roger.ferrer at bsc.es>; Arai, Masaki <arai.masaki at jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 12:39 PM, Sjoerd Meijer wrote:
Hello Simon,

Thanks for your replies, very useful.  And yes, thanks for the example and making the target differences clear:

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

Unless I miss something, the AVX example is semantically the same as get.active.lane.mask:

   %m[i] = icmp ult (%base + i), %n

with i  = 8.
Correct (llvm.get.active.lane.mask.v8i1.i32).

Just saying this to see if we can have "1 interface" for generating the mask (which is what I was perhaps expecting), and if you just want an all true mask for VE and if we can merge AVX with the other 2 we just have:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
For VE, we want to do as much predication as possible through %evl and as little as possible with %mask. This has performance implications on VE and RISC-V - VE does not generate a mask from %evl but %evl is directly mapped to hardware, passing the all-true mask is free.
So for VE, the %evl does all the predication and there is no reason to have anything other than a (splat i1 1) %mask here.

On SVE/MVE you may want to use get.active.lane.mask instead and on RISC-V V, AFAIU, the %evl parameter will have to be computed by some RISC-V specific `setvl` intrinsic. Both of this is okay because VP gives you that flexibility.

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am wondering if it could be something like:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale(... ,..)
The vscale is only necessary with scalable types, eg you can inactivate the %evl parameter like so:

  llvm.vp.fadd nxv4f128(%x, %y, %mask, (@llvm.vscale() * 4))

The VPIntrinsic class upstream already has the functionality to check whether the %evl parameter is inactivated in this way (VPIntrinsic::canIgnoreVectorLengthParam()).

Cheers,
Sjoerd.
- Simon

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM><mailto:Simon.Moll at EMEA.NEC.COM>
Sent: 06 November 2020 10:07
To: Roger Ferrer Ibáñez <rofirrim at gmail.com><mailto:rofirrim at gmail.com>; Sjoerd Meijer <Sjoerd.Meijer at arm.com><mailto:Sjoerd.Meijer at arm.com>
Cc: Renato Golin <rengolin at gmail.com><mailto:rengolin at gmail.com>; Vineet Kumar <vineet.kumar at bsc.es><mailto:vineet.kumar at bsc.es>; LLVM Dev <llvm-dev at lists.llvm.org><mailto:llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ <roger.ferrer at bsc.es><mailto:roger.ferrer at bsc.es>; Arai, Masaki <arai.masaki at jp.fujitsu.com><mailto:arai.masaki at jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,

Trying to remember how everything fits together here, but could get.active.lane.mask not create the %mask of the VP intrinsics? Or in other words, in the vectoriser, who's producing the %mask and %evl that is consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop Vectorizer. I imagine at some point we can teach LV to emit VPred for the widening. VPred IR needs two additional operands, as you mentioned, %evl and %mask.

One option is make %evl the max-vector-length of the type being operated and %mask (that is the "outer block mask" in this context) be get.active.lane.mask. This maps well for SVE and MVE not so much for VE and RISC-V (I don't think it is incorrect but it is not an efficient thing to do).  Perhaps VE and RISC-V can work in this scenario if at some point they replace the %evl with something like "%n - %base" operands of get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an 'llvm.vp.fadd' for all predicating targets. However, whether get.active.lane.mask() is used for %mask or whether tail predication is done with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an operand of some target hook because "computing" the %evl is target-specific, targets without evl could compute the identity here) and %mask (the outer block mask) be a splat of "i1 1". This maps well VE and RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target where TargetTransformInfo::hasActiveVectorLength returns false). Those targets could replace the %evl with the max-vector-length of the operated type and then use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is that Simon used this approach in https://reviews.llvm.org/D78203 but in a more general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:

  ; Call the target hook to let the target select %mask and %evl params for the loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask, %evl)

Looks to me the second option makes a more effective use of vpred and D78203 shows that we can always soften vpred into a shape that is reasonable for lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated instructions/intrinsics that everybody is using while giving people the freedom to use these as it fits their targets. We can then concentrate on optimizing VP intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial setting of %mask and %evl at the same time (%evl for packs of two 32bit elements (ie %evl must be even for 32bit lanes), %mask for masking out inside packages).

Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201109/cb4c9986/attachment.html>