<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Feb 1, 2019 at 2:59 AM Bruce Hoult <<a href="mailto:brucehoult@sifive.com">brucehoult@sifive.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <<a href="mailto:programmerjake@gmail.com" target="_blank">programmerjake@gmail.com</a>> wrote:<br>

> Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.<br>

<br>

Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and<br>

N could be as small as 1 though support for smaller than i8 is<br>

optional. (no distinction is drawn between int and float in the vector<br>

configuration -- that's up to the operations performed)<br>

<br>

> We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.<br>

<br>

Do you mean instructions following the standard 48-bit encoding<br>

scheme, that happen to contain a standard 32 bit instruction as a<br>

payload?<br></blockquote><div>Yes. We reuse the 2 LSB bits from the 32-bit instruction (since they are constant) to allow for more prefix bits. An example prefix scheme (that took the complexity waaay too far, we're working on that): <a href="https://salsa.debian.org/Kazan-team/kazan/blob/0c5abb5d35b03c52a21a54d4002f76bcec6c5d1d/docs/Prefix%20Proposal.md">https://salsa.debian.org/Kazan-team/kazan/blob/0c5abb5d35b03c52a21a54d4002f76bcec6c5d1d/docs/Prefix%20Proposal.md</a></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

>Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.<br>

<br>

No real difference. The standard RISC-V Vector extension is intended<br>

to allow exactly those changes to the vector configuration every few<br>

instructions. It's mostly the microcontroller people coming from<br>

DSP/SIMD who want to do that, so it's up to them to make that<br>

efficient on their cores -- they might even do macro-op fusion on it.<br></blockquote><div>Yeah, that works, but you need a larger instruction fetch bandwidth.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Big OoO/Supercomputer style code compiled from C/FORTRAN in general<br>

doesn't want to do that kind of thing.<br></blockquote><div>We're aiming for SIMT-style code (Vulkan Shaders) converted into variable-length vector operations, so it's different than either microcontroller or supercomputer styles.</div><div>Before vectorization, short vectors are used to represent:</div><div>- colors (RGBA)</div><div>- positions (XYZ)</div><div>- geometric vectors (XYZ)</div><div>- transformation matrices (4x4 or 4x3/3x4)</div><div>- positions in homogeneous coordinates (XYZW)<br></div><div>- and more.</div><div><br></div><div>The short vectors are used more as a grouping mechanism (like a struct or class) rather than just a method of improving performance.</div><div><br></div><div>One problem with the V extension in this use case is that 3-element vectors (pre-vectorization) are quite common, so if there were a mechanism to natively support them, we could pack them tightly in registers and ALUs, preventing a 25% performance loss.</div><br>An example: <a href="http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html">http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html</a><br>Relevant section reproduced for convenience:<br>struct VertexIn<br>{<br>    vec3 position;<br>    vec3 normal;<br>    vec4 color; // rgba<br>};<br>struct VertexOut<br>{<br>    vec4 position; // xyzw<br>    vec4 color;<br>};<br>VertexIn vertexes_in[];<br>VertexOut vertexes_out[];<br>vec3 light_dir;<br>float ambient, diffuse;<br>for(int i = 0; i < 1000; i++)<br>{<br>    // calculate vertex colors using<br>    // lambert's cos model and fixed ambient brightness<br>    vec3 n = vertexes_in[i].normal;<br>    vec3 l = light_dir;<br>    float dot = n.x * l.x + n.y * l.y + n.z * l.z;<br>    float brightness = max(dot, 0.0) * diffuse + ambient;<br>    vec4 c = vertexes_in[i].color;<br>    c.rgb *= brightness;<br>    vertexes_out[i].color = c;<br>    // orthographic projection<br>    vertexes_out[i].position = vec4(vertexes_in[i].position, 1.0);<br>}<br><br>vectorization produces:<br>for(int i = 0;;)<br>{<br>    VL = setvl(1000 - i);<br>    vec3xVL n = load3xVL_strided(&vertexes_in[i].normal, sizeof(VertexIn));<br>    vec3 l = light_dir;<br>    vecVL dot = n.x * l.x + n.y * l.y + n.z * l.z;<br>    vecVL brightness = max(dot, 0.0) * diffuse + ambient;<br>    vec4xVL c = load4xVL_strided(&vertexes_in[i].color, sizeof(VertexIn));<br>    vec3xVL c_rgb = c.rgb;<br>    c_rgb *= brightness;<br>    c.rgb = c_rgb;<br>    store4xVL_strided(&vertexes_out[i].color, c, sizeof(VertexOut));<br>    vec4xVL p = 1.0;<br>    <a href="http://p.xyz">p.xyz</a> = load3xVL_strided(&vertexes_in[i].position, sizeof(VertexIn));<br>    store4xVL_strided(&vertexes_out[i].position, p, sizeof(VertexOut));<br>    i += VL;<br>}<br><br>Jacob Lifshay</div></div></div></div>