[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Robin Kruppe via llvm-dev llvm-dev at lists.llvm.org
Wed Aug 1 12:59:51 PDT 2018


On 1 August 2018 at 18:26, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 08/01/2018 06:15 AM, Renato Golin wrote:
>> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>> In some sense, if you make vscale dynamic,
>>> you've introduced dependent types into LLVM's type system, but you've
>>> done it in an implicit manner. It's not clear to me that works. If we
>>> need dependent types, then an explicit dependence seems better. (e.g.,
>>> <scalable <n> x %vscale_var x <type>>)
>> That's a shift from the current proposal and I think we can think
>> about it after the current changes. For now, both SVE and RISC-V are
>> proposing function boundaries for changes in vscale.
>
> I understand. I'm afraid that the function-boundary idea doesn't work
> reasonably.

FWIW, I don't think dependent types really help with the code motion
problems. While using an SSA value in a type would presumably enforce
that instructions mentioning that type have to be dominated by the
definition of said value, the real problem is when you _stop_ using
one vscale (and presumably start using another). For example, we want
to rule out the following:

  %vscale.1 = call i32 @change_vscale(...)
  %v.1 = load <scalable 4 x %vscale.1 x i32> ...
  %vscale.2 = call i32 @change_vscale(...)
  %v.2 = load <scalable 4 x %vscale.1 x i32> ... ; vscale changed but
we're still doing things with the old one

And of course, actually introducing this notion of types mentioning
SSA values into LLVM would be an extraordinarily huge and difficult
step. I did actually consider something along these lines (and even
had a digression about it in drafts of my RFC, but I cut it in the
final version) but I don't think it's viable.

Tying some values to the function they're in, on the other hand, even
has precedent in current LLVM: tokens values must be confined to one
function (intrinsics are special, of course), so most of the
interprocedural passes already must be careful with moving certain
kinds of values between functions. It's ad-hoc and requires auditing
passes, yes, but it's something we know and have some experience with.

(The similarity to tokens is strong enough that my original proposal
heavily leaned on tokens to encode the restrictions on the optimizer
that are needed for different-vscale-per-function, but I've been
persuaded that it's more trouble than it's worth, hence the "implicit"
approach of this RFC.)

>>
>>
>>> 2. How would the function-call boundary work? Does the function itself
>>> have intrinsics that change the vscale?
>> Functions may not know what their vscale is until they're actually
>> executed. They could even have different vscales for different call
>> sites.
>>
>> AFAIK, it's not up to the compiled program (ie via a function
>> attribute or an inline asm call) to change the vscale, but the
>> kernel/hardware can impose dynamic restrictions on the process. But,
>> for now, only at (binary object) function boundaries.
>
> I'm not sure if that's better or worse than the compiler putting in code
> to indicate that the vscale might change. How do vector function
> arguments work if vscale gets larger? or smaller?

I don't see any way for the OS to change a running process's vscale
without a great amount of cooperation from the program and the
compiler. In general, the kernel has nowhere near enough information
to identify spots where it's safe to fiddle with vscale -- function
call boundaries aren't safe in general, as you point out.  FWIW, in
the RISC-V vector task group we discussed migrating running processes
between cores in heterogenous architectures (e.g. think big.LITTLE)
that may have different vector register sizes. We quickly agreed that
there's no way to make that work and dismissed the idea. The current
thinking is, if you want to migrate a process that's currently using
the vector unit, you can only migrate it between cores that have the
same kind of register field.

For the RISC-V backend I don't want anything to do with OS
shenangians, I'm exclusively focused on codegen. The backend inserts
machine code in the prologue that configures the vector unit in
whatever way the backend considers best, and this configuration
determines vscale (and some other things that aren't visible to IR).
The caller saves their vector unit state before the call and restores
it after the call returns, so their vscale is not affected by the call
either.

For SVE, I could imagine a function attribute that indicates it's OK
to change vscale at this point (this will probably have to be a very
careful and deliberate decision by a programmer). The backend could
then change vscale in the prologie, either set it to a specific value
(e.g., requested by the attribute) or make a libcall asking the kernel
to adjust vscale if it wants to.

In both cases, the change happens after the caller saved all their
state and before any of the callee's code runs.

That leaves arguments and return values, and more generally any vector
values that are shared (e.g., in memory) between caller and callee.
Indeed it's not possible to share any vectors between two functions
that disagree on how large a vector is (sounds obvious when you put it
that way). If you need to pass vectors in any way, caller and callee
have to agree on vscale as part of the ABI, and the callee does *not*
change vscale but "inherits" it from the caller. On SVE that's the
default ABI, on RISC-V there will be one or multiple non-default
"vector call" ABIs (as Bruce mentioned in an earlier email).

In IR we could represent these different ABIs though calling
convention numbers, function attributes, or a combination thereof.
With ABIs where caller and callee don't necessarily agree on vscale,
it is simply impossible to pass vector values (and while you can e.g.
pass the caller's vscale value, it probably isn't meaningful to the
callee):

- it's a verifier error if such a function takes or returns scalable
vectors directly
- a source program that e.g. tries to smuggle a vector from one
function to another through heap memory is erroneous
- the optimizer must not introduce such errors in correct input programs

The last point means, for example, that partial inlining can't pull
the computation of a vector value into the caller and pass the result
as a new argument. Such optimizations wouldn't be correct anyway,
regardless of ABI concerns: the instructions that are affected all
depend on vscale and therefore moving them to a different function
changes their behavior. Of course, this doesn't mean all
interprocedural optimizations are invalid. *Complete* inlining, for
example, is always valid.

Of course, all this applies only if caller and callee don't agree on
vscale. With suitable ABIs, all existing optimizations can be applied
without problem.

> So, if I have some vectorized code, and we figure out that some of it is
> cold, so we outline it, and then the kernel decides to decrease vscale
> for that function, now I have broken the application? Storing a vector
> argument in memory in that function now doesn't store as much data as it
> would have in the caller?
>
>>
>> I don't know how that works at the kernel level (how to detect those
>> boundaries? instrument every branch?) but this is what I understood
>> from the current discussion.
>
> Can we find out?
>
>>
>>
>>> If so, then it's not clear that
>>> the function-call boundary makes sense unless you prevent inlining. If
>>> you prevent inlining, when does that decision get made? Will the
>>> vectorizer need to outline loops? If so, outlining can have a real cost
>>> that's difficult to model. How do return types work?
>> The dynamic nature is not part of the program, so inlining can happen
>> as always. Given that the vectors are agnostic of size and work
>> regardless of what the kernel provides (within safety boundaries), the
>> code generation shouldn't change too much.
>>
>> We may have to create artefacts to restrict the maximum vscale (for
>> safety), but others are better equipped to answer that question.
>>
>>
>>>  1. I can definitely see the use cases for changing vscale dynamically,
>>> and so I do suspect that we'll want that support.
>> At a process/function level, yes. Within the same self-contained
>> sub-graph, I don't know.
>>
>>
>>>  2. LLVM does not have loops as first-class constructs. We only have SSA
>>> (and, thus, dominance), and when specifying restrictions on placement of
>>> things in function bodies, we need to do so in terms of these constructs
>>> that we have (which don't include loops).
>> That's why I was trying to define the "self-contained sub-graph" above
>> (there must be a better term for that). It has to do with data
>> dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure
>> side-effects don't leak out.
>>
>> A loop iteration is usually such a block, but not all are and not all
>> such blocks are loops.
>>
>> Changing vscale inside a function, but outside of those blocks would
>> be "fine", as long as we made sure code movement respects those
>> boundaries and that context would be restored correctly on exceptions.
>> But that's not part of the current proposal.
>
> But I don't know how to implement that restriction without major changes
> to the code base. Such a restriction doesn't follow from use/def chains,
> and if we need a restriction that involves looking for non-SSA
> dependencies (e.g., memory dependencies), then I think that we need
> something different than the current proposal. Explicitly dependent
> types might work, something like intrinsics might work, etc.

Seconded, this is an extraordinarily difficult problem. I've spent
unreasonable amounts of time thinking about ways to model changing
vector sizes and sketching countless designs for it. Multiple times I
convinced myself some clever setup would work, and every time I later
discovered a fatal flaw. Until I settled on "only at funciton
boundaries", that is, and even that took a few iterations.


Cheers,
Robin

> Thanks again,
> Hal
>
>>
>> Chaning vscale inside one of those blocks would be madness. :)
>>
>> cheers,
>> --renato
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>


More information about the llvm-dev mailing list