[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tue Jul 31 08:36:11 PDT 2018

Renato Golin <renato.golin at linaro.org> writes:

>> The points where VL would be changed are limited and I think would
>> require limited, straightforward additions on top of this proposal.
>
> Indeed. I have a limited view on the spec and even more so on hardware
> implementations, but it is my understanding that there is no attempt
> to change VL mid-loop.

What does "mid-loop" mean?  On traditional vector architectures it was
very common to change VL for the last loop iteration.  Otherwise you had
to have a remainder loop.  It was much better to change VL.

> If we can assume VL will be "the same" (not constant) throughout every
> self-contained sub-graph (from scalar|memory->vector to
> vector->scalar|memory), there we should encode it in the IR spec that
> this is a hard requirement.
>
> This seems consistent with your explanation of the Cray VL change as
> well as Bruce's description of RISC-V (both seem very similar to me),
> where VL can change between two loop iterations but not within the
> same iteration.

Ok, I think I am starting to grasp what you are saying.  If a value
flows from memory or some scalar computation to vector and then back to
memory or scalar, VL should only ever be set at the start of the vector
computation until it finishes and the value is deposited in memory or
otherwise extracted.  I think this is ok, but note that any vector
functions called may change VL for the duration of the call.  The change
would not be visible to the caller.

Just thinking this through, a case where one might want to change VL
mid-stream is something like a half-length set of operations that feeds
a vector concat and then a full length set of operations following.  But
again I think this would be a strange way to do things.  If someone
really wants to do this they can predicate away the upper bits of the
half-length operations and maintain the same VL throughout the
computation.  If predication isn't available they they've got more
serious problems vectorizing code.  :)

> We will still have to be careful with access safety (alias, loop
> dependencies, etc), but that shouldn't be different than if VL was
> required to be constant throughout the program.

Yep.

>> That's right.  This proposal doesn't expose a way to change vscale, but
>> I don't think it precludes a later addition to do so.
>
> That was my point about this change being harder to do later than now.

I guess I don't see why it would be any harder later.

> I think no one wants to do that now, so we're all happy to pay the
> price later, because that will likely never come.

I am not so sure about that.  Power requirements may very well drive
more dynamic vector lengths.  Even today some AVX 512 implementations
falter if there are "too many" 512-bit operations.  Scaling back SIMD
width statically is very common today and doing so dynamically seems
like an obvious extension.  I don't know of any efforts to do this so
it's all speculative at this point.  But the industry has done it in the
past and we have a curious pattern of reinventing things we did before.

>> I don't see why predicate values would be affected at all.  If a machine
>> with variable vector length has predicates, then typically the resulting
>> operation would operate on the bitwise AND of the predicate and a
>> conceptual all 1's predicate of length VL.
>
> I think the problem is that SVE is fully predicated and Cray (RISC-V?)
> is not, so mixing the two could lead into weird predication
> situations.

Cray vector ISAs were fully predicated and also used a vector length.
It didn't cause us any serious issues.  In many ways having an
adjustable VL and predication makes things easier because you don't have
to regenerate predicates to switch to a shorter VL.

> So, if a high level optimisation pass assumes full predication and
> change the loop accordingly, and another pass assumes no predication
> and adds VL changes (say, loop tails), then we may end up with
> incompatible IR that will be hard to select down in ISel.
>
> Given that SVE has both predication and vscale change, this could
> happen in practice. It wouldn't be necessarily wrong, but it would
> have to be a conscious decision.

It seems strange to me for an optimizer to operate in such a way.  The
optimizer should be fully aware of the target's capabilities and use
them accordingly.  But let's say this happens.  Pass 1 vectorizes the
loop with predication (for a conditional loop body) and creates a
remainder loop, which would also need to be predicated.  Note that such
a remainder loop is not necessary with full predication support but for
the sake of argument lets say pass 1 is not too smart.

Pass 2 comes along and says, "hey, I have the ability to change VL so we
don't need a remainder loop."  It rewrites the main loop to use dynamic
VL and removes the remainder loop.  During that rewrite, pass 2 would
have to maintain predication.  It can use the very same predicate values
pass 1 generated.  There is no need to adjust them because the VL is
applied "on top of" the predicates.

Pass 2 effectively rewrites the code to what the vectorizer should have
emitted in the first place.  I'm not seeing how ISel is any more
difficult.  SVE has an implicit vscale operand on every instruction and
ARM seems to have no difficulty selecting instructions for it.  Changing
the value of vscale shouldn't impact ISel at all.  The same instructions
are selected.

>> Changing vscale would be no different than changing any other value in
>> the program.  The dataflow determines its possible values at various
>> program points.  vscale is an extra (implicit) operand to all vector
>> operations with scalable type.
>
> It is, but IIGIR, changing vscale and predicating are similar
> transformations to achieve the similar goals, but will not be
> represented the same way in IR.

They probably will not be represented the same way, though I think they
could be (but probably shouldn't be).

> Also, they're not always interchangeable, so that complicates the IR
> matching in ISel as well as potential matching in optimisation passes.

I'm not sure it does but I haven't worked something all the way through.

>> Why?  If a user does asm or some other such trick to change what vscale
>> means, that's on the user.  If a machine has a VL that changes
>> iteration-to-iteration, typically the compiler would be responsible for
>> controlling it.
>
> Not asm, sorry. Inline as is "user error".

Ok.

> I meant: make sure adding an IR visible change in VL (say, an
> intrinsic or instruction), within a self-contained block, becomes an
> IR error.

What do you mean by "self-contained block?"  Assuming I understood it
correctly, the restriction you described at the top seems reasonable for
now.

>> If the vendor provides some target intrinsics to let the user write
>> low-level vector code that changes vscale in a high-level language, then
>> the vendor would be responsible for adding the necessary bits to the
>> frontend and LLVM.  I would not recommend a vendor try to do this.  :)
>
> Not recommending by making it an explicit error. :)
>
> It may sound harsh, but given we're taking some pretty liberal design
> choices right now, which could have long lasting impact on the
> stability and quality of LLVM's code generation, I'd say we need to be
> as conservative as possible.

Ok, but would be optimizer be prevented from introducing VL changes?

>> I don't see why.  Anyone adding ability to change vscale would need to
>> add intrinsics and specify their semantics.  That shouldn't change
>> anything about this proposal and any such additions shouldn't be
>> hampered by this proposal.
>
> I don't think it would be hard to do, but it could have consequences
> to the rest of the optimisation and code generation pipeline.

It could.  I don't think any of us has a clear idea of what those might
be.

> I do not claim to have a clear vision on any of this, but as I said
> above, it will pay off long term is we start conservative.

Being conservative is fine, but we should have a clear understanding of
exactly what that means.  I would not want to prohibit all VL changes
now and forever, because I see that as unnecessarily restrictive and
possibly damaging to supporting future architectures.

If we don't want to provide intrinsics for changing VL right now, I'm
all in favor.  There would be no reason to add error checks because
there would be no way within the IR to change VL.

But I don't want to preclude adding such intrinsics in the future.

>> I don't think we should worry about taking IR with dynamic changes to VL
>> and trying to generate good code for any random target from it.  Such IR
>> is very clearly tied to a specific kind of target and we shouldn't
>> bother pretending otherwise.
>
> We're preaching for the same goals. :)

Good!  :)

> But we're trying to represent slightly different techniques
> (predication, vscale change) which need to be tied down to only
> exactly what they do.

Wouldn't intrinsics to change vscale do exactly that?

> Being conservative and explicit on the semantics is, IMHO, the easiest
> path to get it right. We can surely expand later.

I'm all for being explicit.  I think we're basically on the same page,
though there are a few things noted above where I need a little more
clarity.

                               -David