[llvm-dev] Adding support for vscale

Tue Oct 1 07:55:50 PDT 2019

hi graham,

On Tue, Oct 1, 2019 at 2:07 PM Graham Hunter <Graham.Hunter at arm.com> wrote:

> > the RISC-V Vector Spec *allows* arbitrary MVL length (so there are
> > multiple vendors each choosing an arbitrary MVL suited to their
> > customer's needs).  "RVV-compliant hardware" would fit things better.
>
> Yes, the hardware doesn't change, dynamic/active VL just stops processing
> elements past the number of active elements.
>
> SVE similarly allows vendors to choose a maximum hardware vector length
> but skips an active VL in favour of predication only.

yes.

> I'll try and clear things up with a concrete example for SVE.
>
> Allowable SVE hardware vector lengths are all multiples of 128 bits. So
> our main legal types for codegen will have a minimum size of 128 bits,
> e.g. <vscale x 4 x i32>.

> For Fujitsu's A64FX at 512 bits, vscale is 4 and legal type would now be
> equivalent to <16 x i32>.

okaaaay, so, right, it is kinda similar to MVL for RVV, except
dynamically settable in powers of 2.  okaay.  makes sense: just as
with Cray-style Vectors, high-end machines can go extremely wide.

> In all cases, vscale is constant at runtime for those machines. While it
> is possible to change the maximum vector length from privileged code (so
> you could set the A64FX to run with 256b or 128b vectors if you chose...
> even 384b if you wanted to), we don't allow for changes at runtime since
> that may corrupt data. Expecting the compiler to be able to recover from
> a change in vector length when you have spilled registers to the stack
> isn't reasonable.

deep breath: i worked for Aspex Semiconductors, they have (had) what
was called an "Array String Processor".  2-bit ALUs could have a gate
opened up which constructed 4-bit ALUs, open up another gate now you
have 8-bit ALUs, open another you have 16-bit, 32-bit, 64-bit and so
on.

thus, as a massively-deep SIMD architecture you could, at runtime,
turn computations round from either using 32 cycles with a batch of
32x 2-bit ALUs to perform 32x separate and distinct parallel 64-bit
operations

OR

open up all the gates, and use ONE cycle to compute a single 64-bit operation.

with LOAD/STORE taking fixed time but algorithms (obviously) taking
variable lengths of time, our job, as FAEs, was to write f*****g
SPREADSHEETS (yes, really) giving estimates of which was the best
possible balance to keep LD/STs an equal time-consumer as the frickin
algorithm.

as you can probably imagine, this being in assembler, and literally a
dozen algorithms having to be written where one would normally do,
code productivity was measured in DAAAAYYYS per line of code.

we do have a genuine need to do something similar, here (except
automated or at an absolute minimum, under the guidance of #pragma).

the reason is because this is for a [hybrid] 3D GPU, to run
texturisation and other workloads.  these are pretty much unlike a CPU
workload: data comes in, gets processed, data goes out.  there's *one*
LD, one algorithm, one ST, in a massive loop covering tens to hundreds
(to gigabytes, in large GPUs) of megabytes per second.

if there's *any* register spill at all, the L1/L2 performance and
power penalty is so harsh that it's absolutely unthinkable to let it
happen.  this was outlined in Jeff Bush's nyuzipass2016 paper.

the solution is therefore to have fine-grained dynamic control over
vscale, on a per-loop basis.  letting the registers spill *cannot* be
permitted, so is not actually a problem per se.

with a fine-grained dynamic control over vscale, we can perform a
(much better, automated) equivalent of the awfulness-we-did-at-Aspex,
analysing the best vscale to use for that loop, that will cover as
many registers as possible, *without* spill.  even if vscale gets set
to 1, that's far, _far_ better than allowing LD/ST register-spilling.

and with most 3D workloads being very specifically designed to fit
into 128 FP32 registers (even for MALI400 and Vivante GC800), and our
design having 128 FP64 registers that can be MMX-style subdivided into
2x FP32, 4x FP16, we should be fine.

> Robin found a way to make this work for RVV; there, he had the additional
> concern of registers being joined together in x2,x4,(x8?) combinations.
> This was resolved by just making the legal types bigger when that feature
> is in use iirc.

unfortunately - although i do not know the full details (Jacob knows
this better than I) there are some 3D workloads involving 4x3 or 3x4
matrices, and Texture datasets with arrays of X,Y,Z coordinates which
means that power-of-two boundaries will result in serious performance
penalties (25% reduction due to a Lane always running empty).

> Would that approach help SV, or is it just a backend thing deciding how
> many scalar registers it can spare?

it would be best to see what Jacob has to say: we're basically likely
to be reserving the top x32-x127 scalar registers for "use" as
vectors.  however being able to dynamically alter the actual
allocation of registers on a per-loop basis [and never "spilling"] is
going to be critical to ensuring the commercial success and acceptance
of the entire processor.

in the absolute worst case we would be forced to set vscale = 1, which
then "punishes" performance by only utilising say x32-x47.  this would
(hypothetically) result in a meagre 25% of peak performance (all 16
registers being effectively utilised as scalar-only).

if however vscale could be dynamically set to 4, that loop could
(hypothetically) deploy registers x32-x95, the parallelism would
properly kick in, and we'd get 4x the level of performance.

ok that was quite a lot, cutting much of what follows...

> > ARM's SVE uses predication. the LDs that would [normally] cause
> > page-faults create a mask, instead, giving *only* those LDs which
> > "succeeded".
>
> Those that succeeded until the first that didn't -- every bit in the
> mask after a fault is unset, even if it would have succeeded with a
> first-faulting gather operation.

yehyeh. i do like ffirst, a lot.

> > that's then passed into standard [SIMD-based] predicated operations,
> > masking out operations, the most important one (for the canonical
> > strcpy / memcpy) being the ST.
>
> Nod; I wrote an experimental early exit loop vectorizer which made use of that.

it's pretty awesome, isn't it? :)  the one thing that nobody really
expect to be able to parallelise / auto-vectorise, and it's now
possible!

> Ah, I could have made it a bit clearer. I meant have a first-faulting
> load intrinsic which returns a vector and an integer representing the
> number of valid lanes.

[ah, when it comes up, (early-exit loopvec thread?) i should mention
that in SV we've augmented fail-first to *true* data-dependent
semantics, based on whether the result of [literally any] operation is
zero or not.  work-in-progress here (related to FP "what constitutes
fail" because NaN can be considered "fail").]

> For architectures using a dynamic VL, you could
> then pass that integer to subsequent operations so they are tied to
> that number of active elements.
>
> For SVE/AVX512, we'd have to splat that integer and compare against
> a stepvector to generate a mask. Ugly, but it can be pattern matched
> into the direct first/no-faulting loads and masks for codegen.

this sounds very similar to the RVV use of a special "vmfirst"
predicate-mask instruction which is used to detect the zero point in
the canonical strcpy example.  it... works :)

> Or we just use separate intrinsics.
>
> To discuss later, I think; possibly on the early-exit loopvec thread.

ok, yes, agreed.

thanks graham.

l.