[LLVMdev] ASM output with JIT / codegen barriers

Mon Jan 4 18:43:27 PST 2010

On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net> wrote:
> Hi, thanks everyone for all the comments. I think maybe I wasn't clear that
> I *only* care about atomicity w.r.t. a signal handler interruption in the
> same thread, *not* across threads. Therefore, many of the problems of
> cross-CPU atomicity are not relevant. The signal handler gets invoked via
> pthread_kill, and is thus necessarily running in the same thread as the code
> being interrupted. The memory in question can be considered thread-local
> here, so I'm not worried about other threads touching it at all.

Ok, this helps make sense, but it still is confusing to phrase this as
"single threaded". While the signal handler code may execute
exclusively to any other code, it does not share the stack frame, etc.
I'd describe this more as two threads of mutually exclusive execution
or some such. I'm not familiar with what synchronization occurs as
part of the interrupt process, but I'd verify it before making too
many assumptions.

> This sequence that SBCL does today with its internal codegen is basically
> like:
> MOV <pseudo_atomic>, 1
> [[do allocation, fill in object, etc]]
> XOR <pseudo_atomic>, 1
> JEQ continue
> <<call do_pending_interrupt>>
> continue:
> ...
>
> The important things here are:
> 1) Stores cannot be migrated from within the MOV/XOR instructions to outside
> by the codegen.

Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious. On at
least some non-x86 architectures, I suspect you'll need a memory
barrier here anyways, so it seems reasonable to place one anyways. I
*highly* doubt these fences are an overriding performance concern on
x86, do you have any benchmarks that indicate they are?

> 2) There's no way an interruption can be missed: the XOR is atomic with
> regards to signals executing in the same thread, it's either fully executed
> or not (both load+store). But I don't care whether it's visible on other
> CPUs or not: it's a thread-local variable in any case.
>
> Those are the two properties I'd like to get from LLVM, without actually
> ever invoking superfluous processor synchronization.

Before we start extending LLVM to support expressing the finest points
of the x86 memory model in an optimal fashion given a single thread of
execution, I'd really need to see some compelling benchmarks that it
is a major performance problem. My understanding of the implementation
of these aspects of the x86 architecture is that they shouldn't have a
particularly high overhead.

>> The processor can reorder memory operations as well (within limits).
>> Consider that 'memset' to zero is often codegened to a non-temporal
>> store to memory. This exempts it from all ordering considerations
>
> My understanding is that processor reordering only affects what you might
> see from another CPU: the processor will undo speculatively executed
> operations if the sequence of instructions actually executed is not the
> sequence it predicted, so within a single CPU you should never be able tell
> the difference.
>
> But I must admit I don't know anything about non-temporal stores. Within a
> single thread, if I do a non-temporal store, followed by a load, am I not
> guaranteed to get back the value I stored?

If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.

>
> James
>