[LLVMdev] ASM output with JIT / codegen barriers

Mon Jan 4 01:17:02 PST 2010

On Mon, Jan 4, 2010 at 12:20 AM, Owen Anderson <resistor at mac.com> wrote:
>
> On Jan 3, 2010, at 10:10 PM, James Y Knight wrote:
>
>> In working on an LLVM backend for SBCL (a lisp compiler), there are
>> certain sequences of code that must be atomic with regards to async
>> signals. So, for example, on x86, a single SUB on a memory location
>> should be used, not a load/sub/store sequence. LLVM's IR doesn't
>> currently have any way to express this kind of constraint (...and
>> really, that's essentially impossible since different architectures
>> have different possibilities, so I'm not asking for this...).
>
> Why do you want to do this?  As far as I'm aware, there's no guarantee that a memory-memory SUB will be observed atomically across all processors.  Remember that most processors are going to be breaking X86 instructions up into micro-ops, which might get reordered/interleaved in any number of different ways.

I'm assuming 'memory-memory' there is a typo, and we're just talking
about, a 'sub' instruction with a memory destination. In that case,
I'll go further: the Intel IA-32 manual explicitly tells you that x86
processors are allowed to do the read and write halves of that single
instruction interleaved with other writes to that memory location from
other processors (See section 8.2.3.1 of [1]). =[ I can tell you from
bitter experience debugging code that assumed this, it does in fact
happen. I have watched reference counters miss both increments and
decrements from it on both Intel and AMD systems.

>> All I really would like is to be able to specify the exact instruction
>> sequence to emit there. I'd hoped that inline asm would be the way to
>> do so, but LLVM doesn't appear to support asm output when using the
>> JIT compiler. Is there any hope for inline asm being supported with
>> the JIT anytime soon? Or is there an alternative suggested way of
>> doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment,
>> but that's both more expensive than I need as it has an unnecessary
>> LOCK prefix, and is also theoretically incorrect.

As I've mentioned above, I assure you the LOCK prefix matters. The
strange thing is that you think this is inefficient. Modern processors
don't lock the bus given this prefix to a 'sub' instruction; they just
lock the cache and use the coherency model to resolve the issue. This
is much cheaper than, say, an 'xchg' instruction on an x86 processor.
What is the performance problem you are actually trying to solve here?

> What they don't guarantee per the LangRef is sequential consistency.  If you care about that, you need to use explicit fencing.

Side note: I regret greatly that I didn't know enough of the
sequential consistency concerns here to address them more fully when I
was working on this. =/ Even explicit fencing has subtle problems with
it as currently specified. Is this causing problems for people (other
than jyasskin who clued me in on the whole matter)?