[llvm-dev] [RFC] Vector Predication

Jacob Lifshay via llvm-dev llvm-dev at lists.llvm.org
Fri Feb 1 03:45:34 PST 2019

On Fri, Feb 1, 2019 at 2:59 AM Bruce Hoult <brucehoult at sifive.com> wrote:

> On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com>
> wrote:
> > Neat! I did not know that about the V extension. So this sounds as
> though the V extension would like support for <VL x <4 x float>>-style
> vectors as well.
> Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and
> N could be as small as 1 though support for smaller than i8 is
> optional. (no distinction is drawn between int and float in the vector
> configuration -- that's up to the operations performed)
> > We are currently thinking of defining the extension in terms of a 16-bit
> prefix that changes standard 32-bit instructions into vectorized 48-bit
> instructions, allowing most future or current standard/non-standard
> extensions to be vectorized, rather than having to wait for additional
> extensions to have vector versions added to the V extension (one reason we
> are not using the V extension instead), such as the B extension.
> Do you mean instructions following the standard 48-bit encoding
> scheme, that happen to contain a standard 32 bit instruction as a
> payload?
Yes. We reuse the 2 LSB bits from the 32-bit instruction (since they are
constant) to allow for more prefix bits. An example prefix scheme (that
took the complexity waaay too far, we're working on that):

> >Having a prefix rather than, or in addition to, a layout configuration
> register allows intermixing vector operations on different group/element
> sizes without having to constantly change the vector configuration every
> few instructions.
> No real difference. The standard RISC-V Vector extension is intended
> to allow exactly those changes to the vector configuration every few
> instructions. It's mostly the microcontroller people coming from
> DSP/SIMD who want to do that, so it's up to them to make that
> efficient on their cores -- they might even do macro-op fusion on it.
Yeah, that works, but you need a larger instruction fetch bandwidth.

> Big OoO/Supercomputer style code compiled from C/FORTRAN in general
> doesn't want to do that kind of thing.
We're aiming for SIMT-style code (Vulkan Shaders) converted into
variable-length vector operations, so it's different than either
microcontroller or supercomputer styles.
Before vectorization, short vectors are used to represent:
- colors (RGBA)
- positions (XYZ)
- geometric vectors (XYZ)
- transformation matrices (4x4 or 4x3/3x4)
- positions in homogeneous coordinates (XYZW)
- and more.

The short vectors are used more as a grouping mechanism (like a struct or
class) rather than just a method of improving performance.

One problem with the V extension in this use case is that 3-element vectors
(pre-vectorization) are quite common, so if there were a mechanism to
natively support them, we could pack them tightly in registers and ALUs,
preventing a 25% performance loss.

An example:
Relevant section reproduced for convenience:
struct VertexIn
    vec3 position;
    vec3 normal;
    vec4 color; // rgba
struct VertexOut
    vec4 position; // xyzw
    vec4 color;
VertexIn vertexes_in[];
VertexOut vertexes_out[];
vec3 light_dir;
float ambient, diffuse;
for(int i = 0; i < 1000; i++)
    // calculate vertex colors using
    // lambert's cos model and fixed ambient brightness
    vec3 n = vertexes_in[i].normal;
    vec3 l = light_dir;
    float dot = n.x * l.x + n.y * l.y + n.z * l.z;
    float brightness = max(dot, 0.0) * diffuse + ambient;
    vec4 c = vertexes_in[i].color;
    c.rgb *= brightness;
    vertexes_out[i].color = c;
    // orthographic projection
    vertexes_out[i].position = vec4(vertexes_in[i].position, 1.0);

vectorization produces:
for(int i = 0;;)
    VL = setvl(1000 - i);
    vec3xVL n = load3xVL_strided(&vertexes_in[i].normal, sizeof(VertexIn));
    vec3 l = light_dir;
    vecVL dot = n.x * l.x + n.y * l.y + n.z * l.z;
    vecVL brightness = max(dot, 0.0) * diffuse + ambient;
    vec4xVL c = load4xVL_strided(&vertexes_in[i].color, sizeof(VertexIn));
    vec3xVL c_rgb = c.rgb;
    c_rgb *= brightness;
    c.rgb = c_rgb;
    store4xVL_strided(&vertexes_out[i].color, c, sizeof(VertexOut));
    vec4xVL p = 1.0;
    p.xyz = load3xVL_strided(&vertexes_in[i].position, sizeof(VertexIn));
    store4xVL_strided(&vertexes_out[i].position, p, sizeof(VertexOut));
    i += VL;

Jacob Lifshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190201/7346b32b/attachment.html>

More information about the llvm-dev mailing list