[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
Amara Emerson via llvm-dev
llvm-dev at lists.llvm.org
Tue Jun 5 11:46:06 PDT 2018
> On Jun 5, 2018, at 11:25 AM, Graham Hunter <Graham.Hunter at arm.com> wrote:
>
> Hi David,
>
> Thanks for taking a look.
>
>> On 5 Jun 2018, at 16:23, dag at cray.com wrote:
>>
>> Hi Graham,
>>
>> Just a few initial comments.
>>
>> Graham Hunter <Graham.Hunter at arm.com> writes:
>>
>>> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>>> bytes.
>>
>> "scalable" instead of "scalable x."
>
> Yep, missed that in the conversion from the old <n x m x ty> format.
>
>>
>>> For derived types, a function (getSizeExpressionInBits) to return a pair of
>>> integers (one to indicate unscaled bits, the other for bits that need to be
>>> scaled by the runtime multiple) will be added. For backends that do not need to
>>> deal with scalable types, another function (getFixedSizeExpressionInBits) that
>>> only returns unscaled bits will be provided, with a debug assert that the type
>>> isn't scalable.
>>
>> Can you explain a bit about what the two integers represent? What's the
>> "unscaled" part for?
>
> 'Unscaled' just means 'exactly this many bits', whereas 'scaled' is 'this many bits
> multiplied by vscale'.
>
>>
>> The name "getSizeExpressionInBits" makes me think that a Value
>> expression will be returned (something like a ConstantExpr that uses
>> vscale). I would be surprised to get a pair of integers back. Do
>> clients actually need constant integer values or would a ConstantExpr
>> sufffice? We could add a ConstantVScale or something to make it work.
>
> I agree the name is not ideal and I'm open to suggestions -- I was thinking of the two
> integers representing the known-at-compile-time terms in an expression:
> '(scaled_bits * vscale) + unscaled_bits'.
>
> Assuming the pair is of the form (unscaled, scaled), then for a type with a size known at
> compile time like <4 x i32> the size would be (128, 0).
>
> For a scalable type like <scalable 4 x i32> the size would be (0, 128).
>
> For a struct with, say, a <scalable 32 x i8> and an i64, it would be (64, 256).
>
> When calculating the offset for memory addresses, you just need to multiply the scaled
> part by vscale and add the unscaled as is.
>
>>
>>> Comparing two of these sizes together is straightforward if only unscaled sizes
>>> are used. Comparisons between scaled sizes is also simple when comparing sizes
>>> within a function (or across functions with the inherit flag mentioned in the
>>> changes to the type), but cannot be compared otherwise. If a mix is present,
>>> then any number of unscaled bits will not be considered to have a greater size
>>> than a smaller number of scaled bits, but a smaller number of unscaled bits
>>> will be considered to have a smaller size than a greater number of scaled bits
>>> (since the runtime multiple is at least one).
>>
>> If we went the ConstantExpr route and added ConstantExpr support to
>> ScalarEvolution, then SCEVs could be compared to do this size
>> comparison. We have code here that adds ConstantExpr support to
>> ScalarEvolution. We just didn't know if anyone else would be interested
>> in it since we added it solely for our Fortran frontend.
>
> We added a dedicated SCEV expression class for vscale instead; I suspect it works
> either way.
>
>>
>>> We have added an experimental `vscale` intrinsic to represent the runtime
>>> multiple. Multiplying the result of this intrinsic by the minimum number of
>>> elements in a vector gives the total number of elements in a scalable vector.
>>
>> I think this may be a case where added a full-fledged Instruction might
>> be worthwhile. Because vscale is intimately tied to addressing, it
>> seems like things such as ScalarEvolution support will be important. I
>> don't know what's involved in making intrinsics work with
>> ScalarEvolution but it seems strangely odd that a key component of IR
>> computation would live outside the IR proper, in the sense that all
>> other fundamental addressing operations are Instructions.
>
> We've tried it as both an instruction and as a 'Constant', and both work fine with
> ScalarEvolution. I have not yet tried it with the intrinsic.
+CC Sanjoy to confirm: I think intrinsics should be fine to add support for in SCEV.
>
>>
>>> For constants consisting of a sequence of values, an experimental `stepvector`
>>> intrinsic has been added to represent a simple constant of the form
>>> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>>> start can be added, and changing the step requires multiplying by a splat.
>>
>> This is another case where an Instruction might be better, for the same
>> reasons as with vscale.
>>
>> Also, "iota" is the name Cray has traditionally used for this operation
>> as it is the mathematical name for the concept. It's also used by C++
>> and go and so should be familiar to many people.
>
> Iota would be fine with me; I forget the reason we didn't go with that initially. We
> also had 'series_vector' in the past, but that was a more generic form with start
> and step parameters instead of requiring additional IR instructions to multiply and
> add for the result as we do for stepvector.
>
>>
>>> Future Work
>>> -----------
>>>
>>> Intrinsics cannot currently be used for constant folding. Our downstream
>>> compiler (using Constants instead of intrinsics) relies quite heavily on this
>>> for good code generation, so we will need to find new ways to recognize and
>>> fold these values.
>>
>> As above, we could add ConstantVScale and also ConstantStepVector (or
>> ConstantIota). They won't fold to compile-time values but the
>> expressions could be simplified. I haven't really thought through the
>> implications of this, just brainstorming ideas. What does your
>> downstream compiler require in terms of constant support. What kinds of
>> queries does it need to do?
>
> It makes things a little easier to pattern match (just looking for a constant to start
> instead of having to match multiple different forms of vscale or stepvector multiplied
> and/or added in each place you're looking for them).
>
> The bigger reason we currently depend on them being constant is that code generation
> generally looks at a single block at a time, and there are several expressions using
> vscale that we don't want to be generated in one block and passed around in a register,
> since many of the load/store addressing forms for instructions will already scale properly.
>
> We've done this downstream by having them be Constants, but if there's a good way
> of doing them with intrinsics we'd be fine with that too.
>
> -Graham
>
>
More information about the llvm-dev
mailing list