[LLVMdev] ASM output with JIT / codegen barriers

Mon Jan 4 20:51:30 PST 2010

On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at google.com> wrote:
> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net> wrote:
>> Hi, thanks everyone for all the comments. I think maybe I wasn't clear that
>> I *only* care about atomicity w.r.t. a signal handler interruption in the
>> same thread, *not* across threads. Therefore, many of the problems of
>> cross-CPU atomicity are not relevant. The signal handler gets invoked via
>> pthread_kill, and is thus necessarily running in the same thread as the code
>> being interrupted. The memory in question can be considered thread-local
>> here, so I'm not worried about other threads touching it at all.
>
> Ok, this helps make sense, but it still is confusing to phrase this as
> "single threaded". While the signal handler code may execute
> exclusively to any other code, it does not share the stack frame, etc.
> I'd describe this more as two threads of mutually exclusive execution
> or some such.

I'm pretty sure James's way of describing it is accurate. It's a
single thread with an asynchronous signal, and C allows things in that
situation that it disallows for the multi-threaded case. In
particular, global objects of type "volatile sig_atomic_t" can be read
and written between signal handlers in a thread and that thread's main
control flow without locking. C++0x also defines an
atomic_signal_fence(memory_order) that only synchronizes with signal
handlers, in addition to the atomic_thread_fence(memory_order) that
synchronizes to other threads. See [atomics.fences]

> I'm not familiar with what synchronization occurs as
> part of the interrupt process, but I'd verify it before making too
> many assumptions.
>
>> This sequence that SBCL does today with its internal codegen is basically
>> like:
>> MOV <pseudo_atomic>, 1
>> [[do allocation, fill in object, etc]]
>> XOR <pseudo_atomic>, 1
>> JEQ continue
>> <<call do_pending_interrupt>>
>> continue:
>> ...
>>
>> The important things here are:
>> 1) Stores cannot be migrated from within the MOV/XOR instructions to outside
>> by the codegen.
>
> Basically, this is merely the problem that x86 places a stricter
> requirement on memory ordering than LLVM. Where x86 requires that
> stores occur in program order, LLVM reserves the right to change that.
> I have no idea if it is worthwhile to support memory barriers solely
> within the flow of execution, but it seems highly suspicious.

It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
  asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.

> On at
> least some non-x86 architectures, I suspect you'll need a memory
> barrier here anyways, so it seems reasonable to place one anyways. I
> *highly* doubt these fences are an overriding performance concern on
> x86, do you have any benchmarks that indicate they are?

Memory fences are as expensive as atomic operations on x86 (quite
expensive), but you're right that benchmarks are a good idea anyway.

>> 2) There's no way an interruption can be missed: the XOR is atomic with
>> regards to signals executing in the same thread, it's either fully executed
>> or not (both load+store). But I don't care whether it's visible on other
>> CPUs or not: it's a thread-local variable in any case.
>>
>> Those are the two properties I'd like to get from LLVM, without actually
>> ever invoking superfluous processor synchronization.
>
> Before we start extending LLVM to support expressing the finest points
> of the x86 memory model in an optimal fashion given a single thread of
> execution, I'd really need to see some compelling benchmarks that it
> is a major performance problem. My understanding of the implementation
> of these aspects of the x86 architecture is that they shouldn't have a
> particularly high overhead.
>
>>> The processor can reorder memory operations as well (within limits).
>>> Consider that 'memset' to zero is often codegened to a non-temporal
>>> store to memory. This exempts it from all ordering considerations
>>
>> My understanding is that processor reordering only affects what you might
>> see from another CPU: the processor will undo speculatively executed
>> operations if the sequence of instructions actually executed is not the
>> sequence it predicted, so within a single CPU you should never be able tell
>> the difference.
>>
>> But I must admit I don't know anything about non-temporal stores. Within a
>> single thread, if I do a non-temporal store, followed by a load, am I not
>> guaranteed to get back the value I stored?
>
> If you read the *same address*, then the ordering is guaranteed, but
> the Intel documentation specifically exempts these instructions from
> the general rule that writes will not be reordered with other writes.
> This means that a non-temporal store might be reordered to occur after
> the "xor" to your atomic integer, even if the instruction came prior
> to the xor.

It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.