[LLVMdev] Predicated Vector Operations

Mon May 13 08:20:38 PDT 2013

Chandler Carruth <chandlerc at google.com> writes:

>     There is no race *if* the hardware cache coherence says so. :)
>     There
>     are false sharing issues here and different machines have behaved
>     very
>     differently in the past.
>
> Let's not conflate races with false sharing. They're totally
> different, and false sharing is *not* what we're discussing here.

But in the real world false sharing exists and the compiler has to deal
with it.  We can say, "make codegen deal with it," but these issues
bubble up to the target-independent optimizer nonetheless.

A theoretical memory model is good to have but it's often not
sufficient.

>     The result entirely depends on the machine's consistency model.
>     
>     LLVM is a virtual machine and the IR should define a consistency
>     model.
>     Everything flows from that. I think ideally we'd define the model
>     such
>     that there is no race in the scalar code and the compiler would be
>     free
>     to vectorize it. This is a very strict consistency model and for
>     targets with relaxed semantics, LLVM would have to insert
>     synchronization operations or choose not to vectorize.
>
> LLVM already has a memory model. We don't need to add one. ;] It's
> here for reference: http://llvm.org/docs/LangRef.html#memmodel

I started to look at http://llvm.org/docs/Atomics.html first for a
genter introduction and immediately spotted a problem.  Your first
example precluding register promotion for the update of x is hugely
pessimistic.  I don't particularly care because our optimizer has
already done the transformation before we hit LLVM.  :) But with that
restriction you're leaving a ton of performance on the table.

The same goes for vector code generation, in general.  Our vectorizer
has already done it.  But let's get this right for everyone.

> The only thing that isn't in the model that is relevant here is
> something that isn't in LLVM today -- masked loads and stores. And
> that was what inspired my original question. =D

FWIW, informally, the Cray compiler ignores any concurrency it did not
itself create.  It won't generally introduce loads and stores that
weren't there, but it will certainly eliminate any loads and stores it
can.  We do have atomic operations which generally behave like the LLVM
atomics.  The memory model looks a lot like the C abstract machine.  We
generally give the compiler free reign.

We let the Cray compiler do some unsafe optimization from time to time.
Turning a masked load/operation/masked store into a full load/blend/full
store is a common case.  Users can disable it if they want to be extra
careful.  We worry about false sharing, but only after a certain point
in translation.  These have proven to be very practical and effective
techniques.

I wrote about masked stores vs. full stores in a previous message.  I
believe mask stores should only write to unmasked elements.  It should
not trap on masked elements.  If a developer needs something more
flexible for performance, he or she can do an unsafe transformation,
knowing the implications of doing so.

                                  -David