[llvm-dev] [RFC] Vector Predication

Fri Feb 1 01:18:49 PST 2019

On Thu, Jan 31, 2019 at 11:53 PM Luke Kenneth Casson Leighton via
llvm-dev <llvm-dev at lists.llvm.org> wrote:
>
> ---
> crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
>
> On Thu, Jan 31, 2019 at 10:22 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
> >
> > We're in-progress designing a RISC-V extension (http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) that would have variable-length vectors of short vectors (1 to 4):
> > <VL x <4 x float>>
> > where each predicate bit masks out a whole short vector. We're using this extension to vectorize graphics code where where variables in the pre-vectorization code are short vectors.
> > So, vectorizing code like:
> > for(int i = 0; i < 1000; i++)
> > {
> >     vec4 color = colors[i];
> >     vec3 normal = normals[i];
> >     color.rgb *= fmax(0.0, dot(normal, light_dir));
> >     colors[i] = color;
> > }
> >
> > I'm planning on passing already vectorized code into LLVM and using LLVM as a backend for optimization and JIT code generation.
> >
> > Do you think the EVL proposal would support an ISA like this as it's currently
> > written (by pattern matching on predicate expansion and vector-length
> > multiplication)?
>
> whilst it may be tempting to suggest that a solution is to multiply up
> the bits in the predicate (into groups of 3 or 4), the problem with
> that is that if there are operations that require vec3 or vec4 as
> operands interspersed with predicated operations that do not, that
> realistically implies a need for two separate predicate registers,
> otherwise cycles are wasted swapping predicates OR it implies that the
> architecture *allows* two separate predicate registers to be selected.
>
>  consequently, it would be much, much better to be able to have a
> single bit of a predicate apply to the *entire* vec3 or vec4 type, on
> each outer loop.

This situation can be handled easily in the standard RISC-V vector
extension. You'd do something like...

vsetvli t0, a0, vsew128,vnreg8,vdiv4

... to configure the vector unit to provide eight vector register
variables divided into a standard element width of 128 bits (some
instructions will widen or narrow one step to/from 64 bits or 256
bits), and then dividing each 128 bit element into 4 parts.

Arithmetic/logical/shift will happen on 32 bit elements, but
predication and loads and stores (including strided or scatter/gather)
will operate on 128 bit elements.

[I just made up "vnreg8" as an alias for the standard "vlmul4" because
"vlmul4,vdiv4" might look confusing. Either way it means to put 0b10
into bits [1:0] of the vtype CSR specifying that the 32 vector
registers should be ganged into 8 groups each 4x longer than standard
because (I'm assuming) we need more than four vector registers in this
loop, but no more than eight]