[LLVMdev] ASM output with JIT / codegen barriers

Mon Jan 4 13:13:40 PST 2010

On Jan 4, 2010, at 4:35 AM, Chandler Carruth wrote:

> Responding to the original email...
>
> On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net> wrote:
>> In working on an LLVM backend for SBCL (a lisp compiler), there are
>> certain sequences of code that must be atomic with regards to async
>> signals.
>
> Can you define exactly what 'atomic with regards to async signals'
> this entails? Your descriptions led me to think you may mean something
> other than the POSIX definition, but maybe I'm just misinterpreting
> it. Are these signals guaranteed to run in the same thread? On the
> same processor? Is there concurrent code running in the address space
> when they run?

Hi, thanks everyone for all the comments. I think maybe I wasn't clear  
that I *only* care about atomicity w.r.t. a signal handler  
interruption in the same thread, *not* across threads. Therefore, many  
of the problems of cross-CPU atomicity are not relevant. The signal  
handler gets invoked via pthread_kill, and is thus necessarily running  
in the same thread as the code being interrupted. The memory in  
question can be considered thread-local here, so I'm not worried about  
other threads touching it at all.

I also realize I had (at least :) one error in my original email: of  
course, the atomic operations llvm provides *ARE* guaranteed to do the  
right thing w.r.t. atomicity against signal handlers...they in fact  
just do more than I need, not less. I'm not sure why I thought they  
were both more and less than I needed before, and sorry if it confused  
you about what I'm trying to accomplish.

Here's a concrete example, in hopes it will clarify matters:

@pseudo_atomic = thread_local global i64 0
declare i64* @alloc(i64)
declare void @do_pending_interrupt()
declare i64 @llvm.atomic.load.sub.i64.p0i64(i64* nocapture, i64)  
nounwind
declare void @llvm.memory.barrier(i1, i1, i1, i1, i1)

define i64* @foo() {
   ;; Note that we're in an allocation section
   store i64 1, i64* @pseudo_atomic
   ;; Barrier only to ensure instruction ordering, not needed as a  
true memory barrier
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)

   ;; Call might actually be inlined, so cannot depend upon unknown  
call causing correct codegen effects.
   %obj = call i64* @alloc(i64 32)
   %obj_header = getelementptr i64* %obj, i64 0
   store i64 5, i64* %obj_header ;; store obj type (5) in header word
   %obj_len = getelementptr i64* %obj, i64 1
   store i64 2, i64* %obj_len ;; store obj length (2) in length slot
   ...etc...

   ;; Check if we were interrupted:
   %res = call i64 @llvm.atomic.load.sub.i64.p0i64(i64*  
@pseudo_atomic, i64 1)
   %was_interrupted = icmp eq i64 %res, 1
   br i1 %was_interrupted, label %do-interruption, label %continue

continue:
   ret i64* %obj

do-interruption:
   call void @do_pending_interrupt()
   br label %continue
}

A signal handler will check the thread-local @pseudo_atomic variable:  
if it was already set it will just change the value to 2 and return,  
waiting to be reinvoked by do_pending_interrupt at the end of the  
pseudo-atomic section. This is because it may get confused by the  
proto-object being built up in this code.

This sequence that SBCL does today with its internal codegen is  
basically like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to  
outside by the codegen.
2) There's no way an interruption can be missed: the XOR is atomic  
with regards to signals executing in the same thread, it's either  
fully executed or not (both load+store). But I don't care whether it's  
visible on other CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without  
actually ever invoking superfluous processor synchronization.

> The processor can reorder memory operations as well (within limits).
> Consider that 'memset' to zero is often codegened to a non-temporal
> store to memory. This exempts it from all ordering considerations

My understanding is that processor reordering only affects what you  
might see from another CPU: the processor will undo speculatively  
executed operations if the sequence of instructions actually executed  
is not the sequence it predicted, so within a single CPU you should  
never be able tell the difference.

But I must admit I don't know anything about non-temporal stores.  
Within a single thread, if I do a non-temporal store, followed by a  
load, am I not guaranteed to get back the value I stored?

James