[llvm-dev] Adding support for vscale

Tue Oct 1 01:21:24 PDT 2019

On Tue, Oct 1, 2019 at 8:08 AM Robin Kruppe <robin.kruppe at gmail.com> wrote:
>
> Hello Jacob and Luke,
>
> First off, even if a dynamically changing vscale was truly necessary
> for RVV or SV, this thread would be far too late to raise the question.
> That vscale is constant -- that the number of elements in a scalable
> vector does not change during program execution -- is baked into the
> accepted scalable vector type proposal from top to bottom and in fact
> was one of the conditions for its acceptance...

that should be explicitly made clear in the patches.  it sounds very
much like it's only suitable for statically-allocated
arrays-of-vectorisable-types:

typedef vec4 float[4]; // SEW=32,LMUL=4 probably
static vec4 globalvec[1024]; // vscale == 1024 here

or, would it be intended for use inside functions - again statically-allocated?

int somefn(void) {
  static vec4 localvec[1024]; // vscale == 1024 here
}

*or*, would it be intended to be used like this?
int somefn(num_of_vec4s) {
  static vec4 localvec[num_of_vec4s]; // vscale == dynamic, here
}

clarifying this in the documentation strings on vscale, perhaps even
providing c-style examples, would be extremely useful, and avoid
misunderstandings.

>... (runtime-variable type
> sizes create many more headaches which nobody has worked out
>how to solve to a satisfactory degree in the context of LLVM).

hmmmm.  so it looks like data-dependent fail-on-first is something
that's going to come up later, rather than right now.

> *This* thread is just about whether vscale should be exposed to programs
> in the form of a Constant or as an intrinsic which always returns the same
> value during one program execution.
>
> Luckily, this is not a problem for RVV. I do not know anything about this
> "SV" extension you are working on

SV has been designed specifically to help with the creation of
*Hybrid* CPU / VPU / GPUs.  it's very similar to RVV except that there
are no new instructions added.

a typical GPU would be happy to have 128-bit-wide SIMD or VLIW-style
instructions, on the basis that (A) the shader programs are usually no
greater than 1K in size and (B) those 128-bit-wide instructions have
an extremely high bang-per-buck ratio, of 32x FP32 operations issued
at once.

in a *hybrid* CPU - VPU - GPU context even a 1k shader program hits a
significant portion of the 1st level cache which is *not* separate
from a *GPU*'s 1st level cache because the CPU *is* the GPU.

consequently, SV has been specifically designed to "compactify"
instruction effectiveness by "prefixing" even RVC 16-bit opcodes with
vectorisation "tags".

this has the side-effect of reducing executable size by over 10% in
many cases when compared to RVV.

> so I cannot comment on that, but I'll sketch the reasons for why it's not
> an issue with RVV and maybe that helps you with SV too.

looks like it does: Jacob explains (in another reply) that MVL is
exactly the same concept, except that in RVV it is hard-coded (baked)
into the hardware, where in SV it is explicitly set as a CSR, and i
explained in the previous reply that in RVV the VL CSR is requested
(and the hardware chooses a value), whereas in SV, the VL CSR *must*
be set to exactly what is requested [within the bounds of MVL, sorry,
left that out earlier].

> As mentioned above, this is tangential to the focus of this thread, so if
> you want to discuss further I'd prefer you do that in a new thread.

it's not yet clear whether vscale is intended for use in
static-allocation involving fixed constants or whether it's intended
for use with runtime-dependent variables inside functions.

with that not being clear, my questions are not tangential to the
focus of the thread.

however yes i would agree that data-dependent fail-on-first is
definitely not the focus of this thread, and would need to be
discussed later.

we are a very small team at the moment, we may end up missing valuable
discussions: how can it be ensured that we are included in future
discussions?

> [...]
> You may be aware of Simon Moll's vector predication (previously:
> explicit vector length) proposal which does just that.

ah yehyehyeh.  i remember.

> In contrast, the vscale concept is more about how many elements a
> vector register contains, regardless of whether some operations process
> only a subset of them.

ok so this *might* be answering my question about vscale being
relate-able to a function parameter (the latter of the c examples), it
would be good to clarify.

> In RVV terms that means it's related not to VL but more to VBITS,
> which is indeed a constant (and has been for many months).

ok so VL is definitely "assembly-level" rather than something that
actually is exposed to the intrinsics.  that may turn out to be a
mistake when it comes to data-dependent fail-on-first capability
(which is present in a *DIFFERENT* form in ARM SVE, btw), but would,
yes, need discussion separately.

> For example <vscale x 4 x i16> has four times as many elements and
> twice as many bits as <vscale x 1 x i32>, so it captures the distinction
> between a SEW=16,LMUL=2 vtype setting and a SEW=32,LMUL=1
> vtype setting.

hang on - so this may seem like a silly question: is it intended that
the *word* vscale would actually appear in LLVM-IR i.e. it is a new
compiler "keyword"?  or did you use it here in the context of just "an
example", where actually the idea is that actual value would be <5 x 4
x i16> or <5 x 1 x i32>?

let me re-read the summary:

"This patch adds vscale as a symbolic constant to the IR, similar to
undef and zeroinitializer, so that it can be used in constant
expressions."

it's a keyword, isn't it?

so, that "vscale" keyword would be substituted at runtime by either a
constant (1024) *or* a runtime-calculated variable or function
parameter (num_of_vec4s), is that correct?

apologies for asking: these are precisely the kinds of
from-zero-prior-knowledge questions that help with any review process
to clarify things for other users/devs.

l.