[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Wed Aug 1 13:28:14 PDT 2018

On 31 July 2018 at 23:32, David A. Greene <dag at cray.com> wrote:
> Robin Kruppe <robin.kruppe at gmail.com> writes:
>
>> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
>> with limiting that to function boundaries. The use case is *not*
>> "changing how large vectors are" in the middle of a loop or something
>> like that, which we all agree is very dubious at best. The RISC-V
>> vector unit is just very configurable (number of registers, vector
>> element sizes, etc.) and this configuration can impact how large the
>> vector registers are. For any given vectorized loop next we want to
>> configure the vector unit to suit that piece of code and run the loop
>> with whatever register size that configuration yields. And when that
>> loop is done, we stop using the vector unit entirely and disable it,
>> so that the next loop can use it differently, possibly with a
>> different register size. For IR modeling purposes, I propose to
>> enlarge "loop nest" to "function" but the same principle applies, it
>> just means all vectorized loops in the function will have to share a
>> configuration.
>>
>> Without getting too far into the details, does this make sense as a
>> use case?
>
> I think so.  If changing vscale has some important advantage (saving
> power?), I wonder how the compiler will deal with very large functions.
> I have seen some truly massive Fortran subroutines with hundreds of loop
> nests in them, possibly with very different iteration counts for each
> one.

Yeah, many loops with different demands on the vector unit in one
function is a problem for the "one vscale per function" approach.
Though for the record, the differences that matter here are not trip
count, but things like register pressure and the bit widths of the
vector elements.

There are some (fragile) workarounds for this problem, such as
splitting up the function. There's also the possibility of optimizing
for this case in the backend: trying to recognize when you can use
different configurations/vscales for two loops without changing
observable behavior (no vector values live between the loops, vscale
doesn't escape, etc.). In general this is of course extremely
difficult, but I hope it'll work well enough in practice to mitigate
this problem somewhat. This is just an educated guess at this point,
we'll have to wait and see how big the impact is on real applications
and real hardware (or simulations thereof).

But at the end of the day, sure, maybe we'll generate sub-optimal code
for some applications. That's still better than making the problem
intractable by being too greedy and ending up with either a broken
compiler or one that can't vary vscale at all.

Cheers,
Robin