[llvm-dev] [RFC] Vector Predication

Tue Feb 5 03:49:24 PST 2019

On Tuesday, February 5, 2019, Simon Moll <moll at cs.uni-saarland.de> wrote:

I think this is the usual mixup of AVL and MVL.
>
> AVL: is part of the predicate

Mmm that's very confusing to say that AVL is part of the predicate.
It's.... kiinda true?

>
>  and can change between vector operations just like a mask can (light
> weight).

Yes, ok, it's more that it is an "advisory". In RVV the program (the
instruction) *requests* a specific AVL and the processor responds with an
*actual* AVL of between 0 (yes really, zero) and MIN(MVL, requested_AVL).

To say that it's a predicate, well... a predicate mask, you set it, and the
mask is obeyed, period. AVL, that just doesn't happen.

>
> MVL: Is the physical vector register length and can be re-configured per
> function (RVV only atm) - (heavy weight, stop-the-world instruction).

My understanding of RVV is that MVL is intended to be more of a hardcoded
parameter that is part of the processor design. Any compiler should be
generating code that really does not need to know what MVL is.

SV is slightly different, due to the fact that we use the *scalar* regfile
as if it was a typecasted SRAM. The register number in any given
instruction is just a pointer to the SRAM address at which vector elements
i8/16/32/64 are read/written.

So in SV we need to *set* the MVL, otherwise how can the engine know the
point where it has to stop reading/writing to the register SRAM?

However what is most likely to happen is, MVL will be set globally to e.g 4
and be done with it.

SV semantics for AVL are also slightly different from RVV, not by much
though. The engine is not permitted to choose arbitrary values: if AVL is
requested to be set to 4, it must *be* set to MIN(MVL, 4).  This can
sometimes avoid the need for a loop, entirely (short vectors).

Note also that in SV, neither AVL nor MVL may be set to zero. AVL=1
indicates that the engine is to interpret instructions in SCALAR mode.

The vectorlen parameter in EVL intrinsics is for the AVL.
>
>
Ok so there is a bit of a problem, for both SV and RVV, in that both can
end up with different AVL values from what is requested.

If the API expects that when AVL elements are to be processed, that exactly
that number of elements *will* have been processed, that is simply not the
case and that assumption will result in a catastrophic failure, elements
not being processed.

To deal with that, if it is a hard requirement of the API that exactly the
number of AVL ops are carried out as requested, an otherwise completely
redundant assembly code for-loop will have to be generated.

Oh and then outside of that loop would be the IR level inner loop that was
actually part of the user's program.

Basically what I am saying is that the semantics "request an AVL from the
hardware and get an ACTUAL number of elements to be processed" really needs
to become part of the API.

Now, fascinatingly, for SIMD-only style architectures, that could
hypothetically be used to communicate to the JIT engine converting the IR
to use progressively smaller SIMD widths, on architectures that have
multiple widths.  Also to indicate when corner-case cleanup is to be used.
(SIMD alteady being a mess, this would all not be high priority / optimised)

OR...

the inner workings of AVL are entirely hidden and opaque to the IR. The IR
sets the total explicit number of elements, and It Gets Done.

However I suspect that doing that will open a can o worms.

>>> I'm curious what SVE will do if there is an if/then/else in the middle
>>> of a vectorised loop with a shorter-than-maximum vector length. You
>>> can't just invert the mask when going from the then-part to the
>>> else-part because that would re-enable elements past the end of the
>>> vector. You'd need to invert the mask and then AND it with the mask
>>> containing the (bitwise representation of) the vector length.
>>>
>>
Yep, that is a workable solution for fixed width (SIMD) architectures, it
is a good pattern to use.

 As I mentioned earlier (about the mistake of using gather/scatter as a
means and method of implementing predication), it would be a mistake to try
to "dumb down" this proposal to cater for fixed-length SIMD engines to the
detriment of dynamic-length engines.

If you try that then all the advantages of dynamic-length ISAs are utterly
destroyed, as the only way to implement the compliance with a dumbed-down
fixed-length proposal is: for variable-length ISAs to issue brain-dead
FIXED length assembly code.

Whereas if the API can cope with variable length, the length that is
returned for a SIMD engine may be one of the multiples of SIMD widths that
that engine supports, can use scatter/gather as a substitute for
(potential) lack of predication masks and so on.

If as an industry we want to break free of the seductively broken SIMD
paradigm, then variable-length engines need to be given top priority.

Really. and again, I say that with profuse apologies to all engineers who
have to deal with SIMD. I know it's so much easier to implement at the
hardware level, it's just that SIMD has always made the compiler writers
job absolute hell.

L.

-- 
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190205/b4470a42/attachment.html>