[llvm-dev] Adding support for vscale

Tue Oct 1 04:37:11 PDT 2019

On Tue, Oct 1, 2019 at 11:08 AM Graham Hunter <Graham.Hunter at arm.com> wrote:

> Hi Luke,

hi graham, thanks for responding in such an informative fashion.

> > On 1 Oct 2019, at 09:21, Luke Kenneth Casson Leighton via llvm-dev <llvm-dev at lists.llvm.org> wrote:

> > typedef vec4 float[4]; // SEW=32,LMUL=4 probably
> > static vec4 globalvec[1024]; // vscale == 1024 here
>
> 'vscale' just refers to the scaling factor that gives the maximum size of
> the vector at runtime, not the number of currently active elements.

ok, this starts to narrow down the definition.  i'm attempting to get
clarity on what it means.  so, in the example above involving
globalvec, "maximum size of the vector at runtime" would be "1024"
(not involving RVV VL).

and... would vscale would be dynamically (but permanently) substituted
with the constant "1024", there?

and in that example i gave which was a local function, vscale would be
substituted with "local_vlen_param_len" permanently and irrevocably at
runtime?

or, is it intended to be dynamically (but permanently) substituted
with something related to RVV's *MVL* at runtime?

if it's intended to be substituted by MVL, *that* starts to make more
sense, because MVL may actually vary depending on the hardware on
which the program is being executed.  smaller systems may have an MVL
of only 1 (only allowing one element of a vector to be executed at any
one time) whereas Mainframe or massively-parallel systems may have...
MVL in the hundreds.

> SVE will be using predication alone to deal with data that doesn't fill an
> entire vector, whereas RVV and SX-Aurora

[and SV! :) ]

> want to use a separate mechanism
> that fits with their hardware having a changeable active length.

okaaay, now, yes, i Get It.  this is MVL (Max Vector Length) in RVV.
btw minor nitpick: it's not that "their" hardware changes, it's that
the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
multiple vendors each choosing an arbitrary MVL suited to their
customer's needs).  "RVV-compliant hardware" would fit things better.

hmmm that's going to be interesting for SV, because SV specifically
permits variable MVL *at runtime*.  however, just checking the spec
(don't laugh, yes i know i wrote it...) MVL is set through an
immediate.  there's a way to bypass that and set it dynamically, but
it's intended for context-switching, *not* for general-purpose use.

ah wait.... argh.  ok, is vscale expected to be a global constant *for
the entire application*?  note above: SV allows MVL to be set
*arbitrarily*, and this is extremely important.

the reason it is important is because unlike RVV, SV uses the actual
*scalar* register files.  it does *NOT* have a separate "Vector
Register File".

so if vscale was set to say 8 on a per-runtime basis, that then sets
the total number of registers *in the scalar register file* which will
be utilised for vectorisation.

it becomes impossible to set vscale to 4, which another function might
have been specifically designed to use.

so what would then need to be done is: predicate out the top 4
elements, which now comes with a performance-penalty and a whole
boat-load of mess.

so, apologies: we reaaaally need vscale to be selectable on at the
very least a per-function basis.

otherwise, applications would have to set it (at runtime) to the
"least inconvenient" value, wasting "the least-inconvenient number of
registers".

> The scalable type tells you the maximum number of elements that could be
> operated on,

... which is related (in RVV) to MVL...

> and individual operations can constrain that to a smaller
> number of elements.

... by setting VL.

> > hmmmm.  so it looks like data-dependent fail-on-first is something
> > that's going to come up later, rather than right now.
>
> Arm's downstream compiler has been able to use the scalable type and a
> constant vscale with first-faulting loads for around 4 years, so there's
> no conflict here.

ARM's SVE uses predication. the LDs that would [normally] cause
page-faults create a mask, instead, giving *only* those LDs which
"succeeded".

that's then passed into standard [SIMD-based] predicated operations,
masking out operations, the most important one (for the canonical
strcpy / memcpy) being the ST.

> We will need to figure out exactly what form the first faulting intrinsics
> take of course, as I think SVE's predication-only approach doesn't quite
> fit with others -- maybe we'll end up with two intrinsics?

perhaps - as robin suggests, this for another discussion (not related
to vscale).

or... maybe not.

if vscale was permitted to be dynamically set, not only would it suit
SV's ability to set different vscales on a per-function (or other)
basis, it could be utilised by RVV, SV, and anything else that changes
VL based on data-dependent conditions, to change the following
instructions.

what i'm saying is: vscale needs to be permitted to be a variable, not
a constant.

now, ARM SVE wouldn't *use* that capability: it would hard-code it to
512/SEW/etc.etc. (or whatever), by setting it to a global constant.
follow-up LLVM-IR-morphing passes would end up generating
globally-fixed-width SVE instructions.

RVV would be able to set that vscale variable as a way to indicate
data-dependent lengths [various IR-morphing-passes would carry out the
required substitutions prior to creating actual assembler]

SV would be able to likewise do that *and* start from a value for
vscale that suited each function's requirements to utilise a subset of
the register file which suited the workload.

SV could then trade off "register spill" with "vector size", which i
can tell you right now will be absolutely critical for 3D GPU
workloads.  we can *NOT* allow register spill using LD/STs for a GPU
workload covering gigabytes of data, the power consumption penalty
would just be mental [commercially totally unacceptable]. it would be
far better to allow a function which required that many registers to
dynamically set vscale=2 or possibly even vscale=1

(we have 128 *scalar* registers, where, reminder: MVL is used to say
how many of the *SCALAR* register file get utilised to "make up" a
vector).

oh.  ah.  bruce (et al), isn't there an option in RVV to allow Vectors
to sit on top of the *scalar* register file(s)? (Zfinx)
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-registers

> Or maybe we'll
> be able to synthesize a predicate from an active vlen and pattern match?

the purpose of having a dynamic VL, which comes originally from the
Cray Supercomputer Vector Architecture, is to not have to use the
kinds of instructions that perform bit-manipulation
(mask-manipulation) which are not only wasted CPU cycles, but end up
in many [simpler] hardware implementations with masked-out "Lanes"
running empty, particularly ones that have Vector Front-ends but
predicated-SIMD-style ALU backends.

i would be quite concerned, therefore, if by "synthesise a predicate"
the idea was, instead of using actual dynamic truncation of vlen
(changing vscale), instructions were used to create a predicate which
had its last bits set to zero.

basically using RVV/SV fail-on-first to emulate the way that ARM SVE
fail-on-first creates masks.

that would be... yuk :)

> Sander's patch takes the existing 'vscale' keyword and allows it to be
> used outside the type, to serve as an integer constant that represents the
> same runtime value as it does in the type.

if i am understanding things correctly, it reaaally needs to be
allowed to be a variable, definitely not a constant.

> Some previous discussions proposed using an intrinsic to start with for this,
> and that may still happen depending on community reaction, but the Arm
> hpc compiler team felt it was important to at least start a wider discussion
> on this topic before proceeding. From our experience, using an intrinsic makes
> it harder to work with shufflevector or get good code generation. If someone
> can spot a problem with our reasoning on that please let us know.

honestly can't say, can i leave it to you to decide if it's related to
this vscale thread, and, if so, could you elaborate further?  if it's
not, feel free to leave it for another time?  will see if there is any
follow-up discussion here.

thanks graham.

l.