[llvm-dev] [RFC] Vector Predication

Mon Feb 4 16:54:37 PST 2019

with apologies for breaking the thread, i wasn't cc'd earlier in the
conversation.
http://lists.llvm.org/pipermail/llvm-dev/2019-January/129806.html

david, you wrote:

> I'm solidly of the opinion that we already *have* IR support for
> explicit masking in the form of gather/scatter/etc...  Until someone has
> taken the effort to make masking in this context *actually work well*,
> I'm unconvinced that we should greatly expand the usage in the IR.

the problem with gather/scatter is that it requires moving the data
(MV or LD/ST)

MV - particularly with quite large data sets - puts pressure on a
microarchitecture to increase the size of the register file (otherwise
data has to be pushed to stack).

LD/ST - as shown by Jeff Bush in his work on nyuzi - results in
*significant* power consumption increases due to having to push data
through the L1/L2 cache (which is all CAMs).

in SV we are deliberately dropping the vectorisation onto the
*standard* register file *precisely* to avoid the need to exchange
data between a special vector register file and a scalar register
file.

additionally, the microarchitecture being designed actually happens to
effectively implement (use) gather/scatter techniques when a predicate
mask is used.  this through pushing element operations into a
multi-issue instruction queue, and simply skipping of non-predicated
elements [thus we get 100% ALU utilisation even when there are
back-to-back "if then else" inverted predicate masks (the non-inverted
predicate issuing one set of elements, and the inverted predicate
matches perfectly with that). ]

basically i feel that this is the right paradigm.

now, if a given ISA doesn't *have* predicate masks, then yes,
absolutely, gather/scatter at the *instruction* level (as opposed to
the micro-architectural level) is the correct way to *emulate*
predication.  instructions may be issued that exclude the
non-predicated elements, put them into a group (even a SIMD
fixed-width group), and re-extract them on the other side of the
group-operation into the required destination registers.

even the previously-mentioned SX-Aurora architecture (and other SIMD
architectures) could use this trick, to effectively "emulate"
predication where the ISA doesn't have predicate masks, and it can
also be used to emulate variable-length vectors, through simply
setting the top elements of a SIMD block to zero (or ignoring them
entirely) and only copying out the lower-indexed elements with a
scatter operation.  whilst that is not particularly efficient, that's
not LLVM's problem: SIMD architectures were designed the way they are
because it's seductively simpler at the hardware level.

however to expect an architecture that *does* support proper
predication to have to complexify the way it does predication, by
shoe-horning it into gather/scatter... that's sub-optimal and i'm
drawing a mental blank as to how it could be done, let alone done
effectively and efficiently.

that's not to say that gather/scatter should be removed entirely: that
would be a mistake.  there are circumstances where gather/scatter is
far better suited for use than predicate masks.

bottom line: i feel that expecting predication to be implemented in
terms of gather/scatter is the wrong way round.  the IR should have
explicit and proper support for predicate masks, and architectures
that don't *have* predicate masks should *use* gather/scatter
instructions to emulate it.

l.