[llvm-dev] vectorisation, risc-v

Sun Aug 5 23:12:33 PDT 2018

(please do cc me to preserve thread as i am subscribed digest)

Hi folks, i have a requirement to develop a libre licensed low power
embedded 3D GPU and VPU and using RISCV as the basis (GPGPU style) seems
eminently sensible, and anyone is invited to participate.  A gsoc2017
student named Jake has already developed a Vulkan3D software renderer and
shader, and (parallelised) llvm is a critical dependency to achieving the
high efficiency needed. The difference when compared to gallium3d-llvmpipe
is that Jake's software renderer also uses llvm for the shader, where
g3dllvm does not.

I have reviewed the llvm RV RFC and it looks very sensible, informative,
and well thought through. Keeping VL changes restricted to function call
boundaries is a very good idea (presumably "fake" function calls can be
considered, as a way to break up large functions safely), the instrinsic
vector length, ie passing in the vector length effectively as an additional
hidden function parameter, also very sensible.

I also liked that it was clear from the RFC that LLVM is divided into two
parts, which I suspected but had not had it confirmed.

As an aside I have to say that I am extremely surprised to learn that it is
only in the past year that vectorisation or more specifically variable
length SIMD has hit mainstream in libre licensed toolchains, through ARM
and RISCV.

So some background : I am the author of the SimpleV extension, which has
been developed to provide a uniform *parallelism* API, *not* as a new
Vector Microarchitecture (a common misconception). It has unintended
sideeffects such as providing LD/ST multi with predication, which in turn
can be used on function entry or context switch to save or load *up to* the
entire register file with around three instructions. Another unintended
sideeffect is code size reduction.

There is a total of ZERO new RISCV instructions, the entire design is based
around CSRs that implicitly mark the STANDARD registers as "vectorised",
also providing a redirection table that can arbitrarily redirect the 32
registers to 64 REAL registers (64 real FP and 64 real int), including
empowering Compressed instructions to access the full 64 registers, even
when the C instruction is restricted to x8-x15.  Predication similarly is
via CSR redirection/lookups.

SETVL is slightly different from RV as it requires an immediate length as
an additional parameter. This because the Maximum Vector Length is no
longer hardcoded into silicon, it instead specifies exactly how *many*
contiguous registers in the standard regfile need to be used, NOT how many
are in a totally different regfile and NOT the width of the SIMD / Vector
Lane(s).

So with that as background, I have some questions.

1. I note that the separation between LLVM front and backend looks like
adding SV experimental support would be a simple matter of doing the
backend assembly code translator, with little to no modifications to the
front end needed, would that be about right? Particularly if LLVM-RV
already adds a variable length concept.

2. With there being absolutely no new instructions whatsoever (standard
existing AND FUTURE scalar ops are instead made implicitly parallel), and
given the deliberate design similarities it seems to me that SV would be a
good first experimental backend  *ahead* of RVV, for which the 240+ opcodes
have not yet been finalised. Would people concur?

3. If there are existing patches, where can they be found?

4. From Jeff Bush's Nyuzi work It has been noted that certain 3D operations
are just far too expensive to do as SIMD or vectors. Multiple FP ARGB to
24/32 bit direct overlay with transparency into a tile is therefore for
example a high priority candidate for adding a special opcode that must
explicitly be called. Is this relatively easy to do and is there
documentation explaining how?

5. Although it is way way early to discuss optimisations I did have an idea
that may benefit RVV SV and ARM vectors, jumpstarting them to the sorts of
speeds associated with SIMD. SV has the concept of being able to mark
register sequences (aka vectors) as "packed SIMD y/n" including overriding
a standard opcode's default width, and including predication but on the
PACKED width NOT the element width.  Thus it would seem logical to reflect
this in the extension of basic data types as vectorlen x simdwidth x
datatype as opposed to just vectorlen * datatype as the RFC currently
stands. In doing so *all* of the vectorisation systems could simply
vectorise (and leverage) the *existing* proven SIMD patterns that have
taken years to establish. To illustrate: if the loop length is divisible by
two an instruction VL x 2 x 32bitint would be issued, the SIMD pattern for
2x32bitint could be deployed, including predication down to the 2x32bitint
level if desired, and yet there would be no loop cleanup.

It is worth emphasising that this shall not be a private proprietary hard
fork of llvm, it is an entirely libre effort including the GPGPU (I read
Alex's lowRISC posts on such private forking practices, a hard fork would
be just insane and hugely counterproductive), so in particular regard to
(4) documentation, guidelines and recommendations likely to result in the
upstreaming process going smoothly also greatly appreciated.

Many thanks,

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180806/9557a4a9/attachment.html>