[llvm-dev] Scalable Vector Types in IR - Next Steps?

Wed Mar 20 17:56:14 PDT 2019

On Tue, Mar 19, 2019 at 12:32 PM Chandler Carruth via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> However, the more I talk with and work with my users doing SIMD programming (and my entire experience doing it personally) leads to me to believe this will be of extremely limited utility to model in the IR. There will be a small number of places where it can be used. All of those where performance matters will end up being tuned for *specific* widths anyways to get the last few % of performance. Those that aren't performance critical won't provide any substantial advantage over just being 128-bit vectorized or left scalar. At that point, we pay the complexity and maintenance cost of this completely special type in the IR for no material benefit.

To me, this is nothing like SIMD programming. I've done that, with
VMX/Altivec and NEON.

I've been working with a number of kernels implemented on RISC-V
vectors recently. At least for the things we've been looking at so
far, the code is almost exactly the same as you'd use to implement the
same algorithm (possibly pipelined, unrolled etc) using 32 normal FP
registers, it's just that you work on some unknown-at-compile-time
number of different outer-loop iterations in parallel. For example,
maybe you've got a whole lot of 3x3 matrices to invert. You load each
element of the first matrix into nine registers, then calculate the
determinant, then permute the input values into their new positions
while dividing them by the determinant, and write them all out. It's
exactly the same with the vector ISA, except you might be loading and
working on 1, 2, 4, ... 1000 of the matrices in parallel. You just
don't know, and it doesn't matter. The same for sgemm. You work on
strips eight (say) wide/high. In one dimension you have normal
loads/stores, and in the other dimension you have strided
loads/stores. You're working on rectangular blocks 8 high/wide and
some unknown-at-compile-time amount wide/high -- one some small
machine it might be 1 (i.e. basically a standard FP register file, but
the vector ISA works on it correctly), but presumably on most it will
be something like 4 or 8 or 16 elements. If you unroll either of these
kernels once (or software pipeline it) then you're going to pretty
much saturate your memory system or your fma units or both, depending
on the particular kernel's ratio of compute-to-bytes, how many
functional units you have, and the width of your memory bus.

Maybe you're right and hand-tuned SIMD code with explicit knowledge of
the vector length might get you single-digit percentage better
performance, but it probably won't be more than that and it's a lot of
work.

As for LLVM IR support .. I don't have a firm opinion on whether this
scalable type proposal is sufficient, insufficient, or overkill.

My own gut feeling is that the existing type system is fine for
describing vector data in memory, and that all we need (at least for
RISC-V) is a new register file that is very similar to any machine
with a unified int/fp register file. LLVM needs to manage register
allocation in this register file just as it does for regular int or fp
register files. Spills and reloads of these registers would be
undesirable, but it they are needed then the compiler would have to
allocate the space for this using alloca (or malloc).

The biggest thing needed I think is understanding one unusual
instruction: vsetvl{i}. At the head of each loop you explicitly use
the vsetvl{i} instruction to set the register width (the vector
element width) to something between 8 bits and 1024 bits. The vsetvl
instruction returns an integer which you normally use only to scale by
the element width that you just set, and use the result to bump your
input and output pointers to bump them by N elements instead of 1
element.

So, you kind of need a new type for the registers, but it's purely for
the registers. Not only can you not include it in arrays or structs,
you also can't load it from memory or store it to memory.

The plan for RISC-V is also that all 32 vector registers will be
caller-save/volatile. If you call a function then when it returns you
have to assume that all vector registers have been trashed. There are
no functions using the  standard ABI that take vector registers as
arguments or return vector registers as results. The only apparent
exception is the compiler's runtime library that will have things the
compiler explicitly knows about such as transcendental functions --
but they don't use the standard ABI.