[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Mon Nov 28 04:02:01 PST 2016

>>An initial attempt to represent scalable vectors might be <n x Ty>.  The problem with this approach is there's no perfect interpretation as to what the following type definitions me:
>>
>>	<n x i8>
>>	<n x i16>
>>	<n x i32>
>>	<n x i64>
>>
>>[Interpretation 1]
>>
>>A vector of "n" elements of the specified type.  Here "n" is likely to be scaled based on the largest possible element type.  This fits well with the following loop:
>>
>>(1)	for (0..N) { bigger_type[i] += smaller_type[i]; }
>>
>>but becomes inefficient when the largest element type is not required.
>>
>>[Interpretation 2]
>>
>>A vector full of the specified type. Here the isolated meaning of "n" means nothing without an associated element type.  This fits well with the following loop:
>>
>>(2)	for (0..N) { type[i] += type[i]; }

>I'm with Mehdi on this... these examples don't look problematic. You
>have shown what the different constructs would be good at, but I still
>can't see where they won't be.

I'll apply the loops to their opposite interpretation assuming bigger_type=i64, smaller_type=type=i8:

[Interpretation 1]

(2) for (0..N) { bytes[i] += other_bytes[i]; } ====> <n x i8> += <n x i8>
(2) for (0..N) { int64s[i] += other_int64s[i]; } ====> <n x i64> += <n x i64>

because this interpretation requires "n" to be the same for all scalable vectors clearly the int64 loop involves vectors that are 8x bigger than the byte loop.  Structurally this is fine from the IR's point of view but in hardware they'll operate on vectors of the same length.  The code generator will either split the int64 loop's instructions thus planting 8 adds, or promote the byte loop's instructions thus only utilising an 8th of the lanes.

[Interpretation 2]

(1)	for (0..N) { int64s[i] += bytes[i]; }  ==> <n x i64> += zext <????? x i8> as <n x i64>

This interpretation falls down at the IR level.  If <n x i8> represents a vector full of bytes, how do you represent a vector that's an 8th full of bytes ready be zero-extended.

>I originally though that the extended version "<n x m x Ty>" was
>required because SVE needs all vector lengths to be a multiple of
>128-bits, so they'd be just "glorified" NEON vectors. Without it,
>there is no way to make sure it will be a multiple.

Surely this is true of most vector architectures, hence the reason for costing vectors across a range of element counts to determine which the code generator likes best.  Scalable vectors are no different with SVE's cost model preferring scalable vectors whose statically known length component (i.e. "M x sizeof(Ty)") is 128-bits because they'll better match the way the code generator models SVE registers.

Paul!!!