[llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

Fri Nov 6 03:39:30 PST 2020

Hello Simon,

Thanks for your replies, very useful.  And yes, thanks for the example and making the target differences clear:

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

Unless I miss something, the AVX example is semantically the same as get.active.lane.mask:

   %m[i] = icmp ult (%base + i), %n

with i  = 8.

Just saying this to see if we can have "1 interface" for generating the mask (which is what I was perhaps expecting), and if you just want an all true mask for VE and if we can merge AVX with the other 2 we just have:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()

I am not sure why MVE (or AVX) would need the vscale(). But if it does, I am wondering if it could be something like:

; RISC-V V & VE(*):
  ;   %mask = get.active.lane.mask(%i, %i)
  ;   %evl = call @llvm.vscale(256, %n - %i)
  ; MVE/SVE/AVX :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale(... ,..)

Cheers,
Sjoerd.

________________________________
From: Simon Moll <Simon.Moll at EMEA.NEC.COM>
Sent: 06 November 2020 10:07
To: Roger Ferrer Ibáñez <rofirrim at gmail.com>; Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: Renato Golin <rengolin at gmail.com>; Vineet Kumar <vineet.kumar at bsc.es>; LLVM Dev <llvm-dev at lists.llvm.org>; ROGER FERRER IBANEZ <roger.ferrer at bsc.es>; Arai, Masaki <arai.masaki at jp.fujitsu.com>
Subject: Re: [llvm-dev] Loop-vectorizer prototype for the EPI Project based on the RISC-V Vector Extension (Scalable vectors)

On 11/6/20 8:49 AM, Roger Ferrer Ibáñez wrote:
Hi Sjoerd,

Trying to remember how everything fits together here, but could get.active.lane.mask not create the %mask of the VP intrinsics? Or in other words, in the vectoriser, who's producing the %mask and %evl that is consumed by the VP intrinsics?

I'm not sure what would be the best way here. I think about the Loop Vectorizer. I imagine at some point we can teach LV to emit VPred for the widening. VPred IR needs two additional operands, as you mentioned, %evl and %mask.

One option is make %evl the max-vector-length of the type being operated and %mask (that is the "outer block mask" in this context) be get.active.lane.mask. This maps well for SVE and MVE not so much for VE and RISC-V (I don't think it is incorrect but it is not an efficient thing to do).  Perhaps VE and RISC-V can work in this scenario if at some point they replace the %evl with something like "%n - %base" operands of get.active.lane.mask, and %mask (the outer block mask) is replaced with a splat of "i1 1".
Basically, we would extend TTI to let the targets choose how to use the %mask and %evl operands in the VP intrinsics. So, an 'fadd' would turn into an 'llvm.vp.fadd' for all predicating targets. However, whether get.active.lane.mask() is used for %mask or whether tail predication is done with a (splat i1 1) for the mask and setting %evl would be target dependent.

Another option here is make "%n - %base" be the %evl (or at least an operand of some target hook because "computing" the %evl is target-specific, targets without evl could compute the identity here) and %mask (the outer block mask) be a splat of "i1 1". This maps well VE and RISC-V but makes life harder for AVX-512, SVE and MVE (in general any target where TargetTransformInfo::hasActiveVectorLength returns false). Those targets could replace the %evl with the max-vector-length of the operated type and then use get.active.lane.mask(0, %evl) as the outer block mask. My understanding is that Simon used this approach in https://reviews.llvm.org/D78203 but in a more general setting, that would be independent of what Loop Vectorizer does.

For VE, we set %evl = min(max_vector_width, %n - %base) .. that's the same idiom that the non-LLVM NEC compilers are emitting for tail predication.
Basically, the LV flow could look something like this:

  ; Call the target hook to let the target select %mask and %evl params for the loop header
  %evl, %mask <- IRBuilder.createIterationPredicate(%i, %n, TTI)

  ; Some examples:
  ; RISC-V V & VE(*):
  ;   %mask = (splat i1 1)
  ;   %evl = min(256, %n - %i)
  ; MVE/SVE :
  ;   %mask = get.active.lane.mask(%i, %n)
  ;   %evl = call @llvm.vscale()
  ; AVX:
  ;  %mask = icmp (%i + (seq <8 x i32> 0,1,2,.,)), %n,
  ;  %evl = i32 8

  ; Configure the Vector Predication builder to use those
  VPBuilder
      .setExplicitVectorLength(%evl)
      .setMask(%mask);

  ; Start buildling vector-predicated instructions
  VPBuilder.createFadd(%x, %y)    ; --> call @llvm.vp.fadd(%x, %y, %mask, %evl)

Looks to me the second option makes a more effective use of vpred and D78203 shows that we can always soften vpred into a shape that is reasonable for lowering in targets without active vector length.
The whole point about VP is to make sure there is one set of vector-predicated instructions/intrinsics that everybody is using while giving people the freedom to use these as it fits their targets. We can then concentrate on optimizing VP intrinsic code and all targets benefit.

- Simon

*: VE's packed mode (512 x 32bit elements) is a use case for a non-trivial setting of %mask and %evl at the same time (%evl for packs of two 32bit elements (ie %evl must be even for 32bit lanes), %mask for masking out inside packages).

Thoughts?

Kind regards,
--
Roger Ferrer Ibáñez

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201106/24e204b1/attachment.html>