[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tue Jun 5 11:46:06 PDT 2018


> On Jun 5, 2018, at 11:25 AM, Graham Hunter <Graham.Hunter at arm.com> wrote:
> 
> Hi David,
> 
> Thanks for taking a look.
> 
>> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
>> 
>> Hi Graham,
>> 
>> Just a few initial comments.
>> 
>> Graham Hunter <Graham.Hunter at arm.com> writes:
>> 
>>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>>> bytes.
>> 
>> "scalable" instead of "scalable x."
> 
> Yep, missed that in the conversion from the old <n x m x ty> format.
> 
>> 
>>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>>> integers (one to indicate unscaled bits, the other for bits that need to be
>>> scaled by the runtime multiple) will be added. For backends that do not need to
>>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>>> only returns unscaled bits will be provided, with a debug assert that the type
>>> isn't scalable.
>> 
>> Can you explain a bit about what the two integers represent?  What's the
>> "unscaled" part for?
> 
> 'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
> multiplied by vscale'.
> 
>> 
>> The name "getSizeExpressionInBits" makes me think that a Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale).  I would be surprised to get a pair of integers back.  Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice?  We could add a ConstantVScale or something to make it work.
> 
> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
> 
> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
> compile time like <4 x i32> the size would be (128, 0).
> 
> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
> 
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
> 
> When calculating the offset for memory addresses, you just need to multiply the scaled
> part by vscale and add the unscaled as is.
> 
>> 
>>> Comparing two of these sizes together is straightforward if only unscaled sizes
>>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>>> within a function (or across functions with the inherit flag mentioned in the
>>> changes to the type), but cannot be compared otherwise. If a mix is present,
>>> then any number of unscaled bits will not be considered to have a greater size
>>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>>> will be considered to have a smaller size than a greater number of scaled bits
>>> (since the runtime multiple is at least one).
>> 
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison.  We have code here that adds ConstantExpr support to
>> ScalarEvolution.  We just didn't know if anyone else would be interested
>> in it since we added it solely for our Fortran frontend.
> 
> We added a dedicated SCEV expression class for vscale instead; I suspect it works
> either way.
> 
>> 
>>> We have added an experimental `vscale` intrinsic to represent the runtime
>>> multiple. Multiplying the result of this intrinsic by the minimum number of
>>> elements in a vector gives the total number of elements in a scalable vector.
>> 
>> I think this may be a case where added a full-fledged Instruction might
>> be worthwhile.  Because vscale is intimately tied to addressing, it
>> seems like things such as ScalarEvolution support will be important.  I
>> don't know what's involved in making intrinsics work with
>> ScalarEvolution but it seems strangely odd that a key component of IR
>> computation would live outside the IR proper, in the sense that all
>> other fundamental addressing operations are Instructions.
> 
> We've tried it as both an instruction and as a 'Constant', and both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.
+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in SCEV.
> 
>> 
>>> For constants consisting of a sequence of values, an experimental `stepvector`
>>> intrinsic has been added to represent a simple constant of the form
>>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>>> start can be added, and changing the step requires multiplying by a splat.
>> 
>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>> 
>> Also, "iota" is the name Cray has traditionally used for this operation
>> as it is the mathematical name for the concept.  It's also used by C++
>> and go and so should be familiar to many people.
> 
> Iota would be fine with me; I forget the reason we didn't go with that initially. We
> also had 'series_vector' in the past, but that was a more generic form with start
> and step parameters instead of requiring additional IR instructions to multiply and
> add for the result as we do for stepvector.
> 
>> 
>>> Future Work
>>> -----------
>>> 
>>> Intrinsics cannot currently be used for constant folding. Our downstream
>>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>>> for good code generation, so we will need to find new ways to recognize and
>>> fold these values.
>> 
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota).  They won't fold to compile-time values but the
>> expressions could be simplified.  I haven't really thought through the
>> implications of this, just brainstorming ideas.  What does your
>> downstream compiler require in terms of constant support.  What kinds of
>> queries does it need to do?
> 
> It makes things a little easier to pattern match (just looking for a constant to start
> instead of having to match multiple different forms of vscale or stepvector multiplied
> and/or added in each place you're looking for them).
> 
> The bigger reason we currently depend on them being constant is that code generation
> generally looks at a single block at a time, and there are several expressions using
> vscale that we don't want to be generated in one block and passed around in a register,
> since many of the load/store addressing forms for instructions will already scale properly.
> 
> We've done this downstream by having them be Constants, but if there's a good way
> of doing them with intrinsics we'd be fine with that too.
> 
> -Graham
> 
>