[llvm-dev] Adding support for vscale

Tue Oct 1 06:06:34 PDT 2019

Hi Luke,

>> want to use a separate mechanism
>> that fits with their hardware having a changeable active length.
> 
> okaaay, now, yes, i Get It.  this is MVL (Max Vector Length) in RVV.
> btw minor nitpick: it's not that "their" hardware changes, it's that
> the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
> multiple vendors each choosing an arbitrary MVL suited to their
> customer's needs).  "RVV-compliant hardware" would fit things better.

Yes, the hardware doesn't change, dynamic/active VL just stops processing
elements past the number of active elements.

SVE similarly allows vendors to choose a maximum hardware vector length
but skips an active VL in favour of predication only.

I'll try and clear things up with a concrete example for SVE.

Allowable SVE hardware vector lengths are all multiples of 128 bits. So
our main legal types for codegen will have a minimum size of 128 bits,
e.g. <vscale x 4 x i32>.

If a low-end device implements SVE at 128 bits, then at runtime vscale is
1 and you get exactly <4 x i32>.

For mid-level devices I'd guess 256 bits is reasonable, so vscale would be
2 and <vscale x 4 x i32> would be equivalent to <8 x i32>, but we still
only guarantee that the first 4 lanes exist.

For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be
equivalent to <16 x i32>.

In all cases, vscale is constant at runtime for those machines. While it
is possible to change the maximum vector length from privileged code (so
you could set the A64FX to run with 256b or 128b vectors if you chose...
even 384b if you wanted to), we don't allow for changes at runtime since
that may corrupt data. Expecting the compiler to be able to recover from
a change in vector length when you have spilled registers to the stack
isn't reasonable.

Robin found a way to make this work for RVV; there, he had the additional
concern of registers being joined together in x2,x4,(x8?) combinations.
This was resolved by just making the legal types bigger when that feature
is in use iirc.

Would that approach help SV, or is it just a backend thing deciding how
many scalar registers it can spare?

>> The scalable type tells you the maximum number of elements that could be
>> operated on,
> 
> ... which is related (in RVV) to MVL...
> 
>> and individual operations can constrain that to a smaller
>> number of elements.
> 
> ... by setting VL.

Yes, at least for architectures that support changing VL. Simon's
proposal was to provide intrinsics for common IR operations which
took an additional parameter corresponding to VL; vscale doesn't
represent VL, so doesn't need to change.

>>> hmmmm.  so it looks like data-dependent fail-on-first is something
>>> that's going to come up later, rather than right now.
>> 
>> Arm's downstream compiler has been able to use the scalable type and a
>> constant vscale with first-faulting loads for around 4 years, so there's
>> no conflict here.
> 
> ARM's SVE uses predication. the LDs that would [normally] cause
> page-faults create a mask, instead, giving *only* those LDs which
> "succeeded".

Those that succeeded until the first that didn't -- every bit in the
mask after a fault is unset, even if it would have succeeded with a
first-faulting gather operation.

> that's then passed into standard [SIMD-based] predicated operations,
> masking out operations, the most important one (for the canonical
> strcpy / memcpy) being the ST.

Nod; I wrote an experimental early exit loop vectorizer which made use of that.

>> We will need to figure out exactly what form the first faulting intrinsics
>> take of course, as I think SVE's predication-only approach doesn't quite
>> fit with others -- maybe we'll end up with two intrinsics?
> 
> perhaps - as robin suggests, this for another discussion (not related
> to vscale).
> 
>> Or maybe we'll
>> be able to synthesize a predicate from an active vlen and pattern match?
> 
> the purpose of having a dynamic VL, which comes originally from the
> Cray Supercomputer Vector Architecture, is to not have to use the
> kinds of instructions that perform bit-manipulation
> (mask-manipulation) which are not only wasted CPU cycles, but end up
> in many [simpler] hardware implementations with masked-out "Lanes"
> running empty, particularly ones that have Vector Front-ends but
> predicated-SIMD-style ALU backends.

Yeah, I get that, which is why I support Simon's proposals.

> i would be quite concerned, therefore, if by "synthesise a predicate"
> the idea was, instead of using actual dynamic truncation of vlen
> (changing vscale), instructions were used to create a predicate which
> had its last bits set to zero.
> 
> basically using RVV/SV fail-on-first to emulate the way that ARM SVE
> fail-on-first creates masks.
> 
> that would be... yuk :)

Ah, I could have made it a bit clearer. I meant have a first-faulting
load intrinsic which returns a vector and an integer representing the
number of valid lanes. For architectures using a dynamic VL, you could
then pass that integer to subsequent operations so they are tied to
that number of active elements.

For SVE/AVX512, we'd have to splat that integer and compare against
a stepvector to generate a mask. Ugly, but it can be pattern matched
into the direct first/no-faulting loads and masks for codegen.

Or we just use separate intrinsics.

To discuss later, I think; possibly on the early-exit loopvec thread.

-Graham