[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Sun Nov 27 17:43:29 PST 2016

>>Does this make sense? I am not after agreement just want to make sure we are on the same page regarding our aims before digging down into how VL actually looks and its interaction with the loop vectoriser’s chosen VF.
>
> As much sense as is possible, I guess.

I’ll take that.  Let's move on to the relationship between scalable vectors and VL.  VL is very much a hardware centric value that we'd prefer not to expose at the IR level, beyond the requirements for a sufficiently accurate cost model.

An initial attempt to represent scalable vectors might be <n x Ty>.  The problem with this approach is there's no perfect interpretation as to what the following type definitions me:

	<n x i8>
	<n x i16>
	<n x i32>
	<n x i64>

[Interpretation 1]

A vector of "n" elements of the specified type.  Here "n" is likely to be scaled based on the largest possible element type.  This fits well with the following loop:

(1)	for (0..N) { bigger_type[i] += smaller_type[i]; }

but becomes inefficient when the largest element type is not required.

[Interpretation 2]

A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type.  This fits well with the following loop:

(2)	for (0..N) { type[i] += type[i]; }

Neither interpretation is ideal with implicit knowledge required to understand the relationship between different vector types.  Our proposal is a vector type where that relationship is explicit, namely <n x M x Ty>.

Reconsidering the above loops with this type system leads to IR like:

(1)	<n x 4 x i32> += zext <n x 4 x i8> as <n x 4 x i32>    ; bigger_type=i32, smaller_type=i8
(2)	<n x 16 x i8> += <n x 16 x i8>

Here the value of "n" is the same across both loops and more importantly the bit-width of the largest vectors within both loops is the same.  The relevance of the second point it that we now have a property that can be varied based on a cost model.  This results in a predictable set of types that should lead to performant code, whilst allowing types outside that range to work as expected, just like non-scalable vectors.

All that remains is the ability to reference the isolated value of the "n" in "<n x M x Ty>", which is where the "vscale" constant proposal comes in.

>>     %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
>>
>>for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")
>>
>I see what you mean.
>
>Quick question: Since you're saying "vscale" is an unknown constant,
>why not just:
>     %index.next = add nuw nsw i64 %index, i64 vscale

Hopefully the answer to this is now clear. Our intention is for a single constant to represent the runtime part of a scalable vector's length. Using the same loop examples from above, the induction variable updates become:

(1)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
(2)	%index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 16)

The runtime part of the scalable vector lengths remains the same with the second loop processing 4x the number of elements per iteration.

Does this make sense? Is this sufficient argument for the new type and associated "vscale" constant, or is there another topic that needs covering first?

As an aside, note that I am not describing a new style of vectorisation here.  SVE is perfectly capable of non-predicated vectorisation with the loop-vectoriser ensuring no data-dependency violations using the same logic as for non-scalable vectors.  The exception is that if a strict VF is required to maintain safety we can simply fall back to non-scalable vectors that target Neon.  Obviously not ideal but it gets the ball rolling.

	Paul!!!