[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tue Jun 5 11:25:00 PDT 2018

Hi David,

Thanks for taking a look.

> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
> 
> Hi Graham,
> 
> Just a few initial comments.
> 
> Graham Hunter <Graham.Hunter at arm.com> writes:
> 
>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>>  bytes.
> 
> "scalable" instead of "scalable x."

Yep, missed that in the conversion from the old <n x m x ty> format.

> 
>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>> integers (one to indicate unscaled bits, the other for bits that need to be
>> scaled by the runtime multiple) will be added. For backends that do not need to
>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>> only returns unscaled bits will be provided, with a debug assert that the type
>> isn't scalable.
> 
> Can you explain a bit about what the two integers represent?  What's the
> "unscaled" part for?

'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
multiplied by vscale'.

> 
> The name "getSizeExpressionInBits" makes me think that a Value
> expression will be returned (something like a ConstantExpr that uses
> vscale).  I would be surprised to get a pair of integers back.  Do
> clients actually need constant integer values or would a ConstantExpr
> sufffice?  We could add a ConstantVScale or something to make it work.

I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
integers representing the known-at-compile-time terms in an expression:
'(scaled_bits * vscale) + unscaled_bits'.

Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
compile time like <4 x i32> the size would be (128, 0).

For a scalable type like <scalable 4 x i32> the size would be (0, 128).

For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).

When calculating the offset for memory addresses, you just need to multiply the scaled
part by vscale and add the unscaled as is.

> 
>> Comparing two of these sizes together is straightforward if only unscaled sizes
>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>> within a function (or across functions with the inherit flag mentioned in the
>> changes to the type), but cannot be compared otherwise. If a mix is present,
>> then any number of unscaled bits will not be considered to have a greater size
>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>> will be considered to have a smaller size than a greater number of scaled bits
>> (since the runtime multiple is at least one).
> 
> If we went the ConstantExpr route and added ConstantExpr support to
> ScalarEvolution, then SCEVs could be compared to do this size
> comparison.  We have code here that adds ConstantExpr support to
> ScalarEvolution.  We just didn't know if anyone else would be interested
> in it since we added it solely for our Fortran frontend.

We added a dedicated SCEV expression class for vscale instead; I suspect it works
either way.

> 
>> We have added an experimental `vscale` intrinsic to represent the runtime
>> multiple. Multiplying the result of this intrinsic by the minimum number of
>> elements in a vector gives the total number of elements in a scalable vector.
> 
> I think this may be a case where added a full-fledged Instruction might
> be worthwhile.  Because vscale is intimately tied to addressing, it
> seems like things such as ScalarEvolution support will be important.  I
> don't know what's involved in making intrinsics work with
> ScalarEvolution but it seems strangely odd that a key component of IR
> computation would live outside the IR proper, in the sense that all
> other fundamental addressing operations are Instructions.

We've tried it as both an instruction and as a 'Constant', and both work fine with
ScalarEvolution. I have not yet tried it with the intrinsic.

> 
>> For constants consisting of a sequence of values, an experimental `stepvector`
>> intrinsic has been added to represent a simple constant of the form
>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>> start can be added, and changing the step requires multiplying by a splat.
> 
> This is another case where an Instruction might be better, for the same
> reasons as with vscale.
> 
> Also, "iota" is the name Cray has traditionally used for this operation
> as it is the mathematical name for the concept.  It's also used by C++
> and go and so should be familiar to many people.

Iota would be fine with me; I forget the reason we didn't go with that initially. We
also had 'series_vector' in the past, but that was a more generic form with start
and step parameters instead of requiring additional IR instructions to multiply and
add for the result as we do for stepvector.

> 
>> Future Work
>> -----------
>> 
>> Intrinsics cannot currently be used for constant folding. Our downstream
>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>> for good code generation, so we will need to find new ways to recognize and
>> fold these values.
> 
> As above, we could add ConstantVScale and also ConstantStepVector (or
> ConstantIota).  They won't fold to compile-time values but the
> expressions could be simplified.  I haven't really thought through the
> implications of this, just brainstorming ideas.  What does your
> downstream compiler require in terms of constant support.  What kinds of
> queries does it need to do?

It makes things a little easier to pattern match (just looking for a constant to start
instead of having to match multiple different forms of vscale or stepvector multiplied
and/or added in each place you're looking for them).

The bigger reason we currently depend on them being constant is that code generation
generally looks at a single block at a time, and there are several expressions using
vscale that we don't want to be generated in one block and passed around in a register,
since many of the load/store addressing forms for instructions will already scale properly.

We've done this downstream by having them be Constants, but if there's a good way
of doing them with intrinsics we'd be fine with that too.

-Graham