[cfe-dev] __sync_synchronize doesn't generate a memory barrier

Sun May 22 15:15:36 PDT 2011

On May 22, 2011, at 9:52 AM, Chris Lattner wrote:

> Adding Owen :)
> 
> -Chris
> 
> On May 9, 2011, at 10:38 AM, Arlen Cox wrote:
> 
>> Why does __sync_synchronize() not generate a mfence instruction on x86
>> and x86_64?  I recognize that Apple gcc does not do this either, but I
>> believe this is a bug in Apple gcc as well.  More recent versions of
>> gcc implement a correct behavior (mfence on x86_64 and lock orl $0,
>> (%esp) on x86), but clang emits no code for this operation.
>> 
>> LLVM supports an instruction that emits the correct memory barrier:
>> call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 true)
>> but Clang uses the following, which seems to have no effect on x86:
>> call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 false)
>> 
>> This matters for multi-threaded code as memory barriers are the only
>> way we can force an ordering on loads and stores.

If you have a standalone __sync_synchronize that's failing to generate an mfence, that is almost certainly a bug.  That said, there are a lot of circumstances where mfences aren't actually necessary.  X86 implements a very strong memory model ([1] and [2]]), guaranteeing the following:

	• Loads are not reordered with other loads.
	• Stores are not reordered with other stores.
	• Stores are not reordered with older loads.
	• In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
	• In a multiprocessor system, stores to the same location have a total order.
	• In a multiprocessor system, locked instructions have a total order.
	• Loads and stores are not reordered with locked instructions.

Based on the last one, it is legal to eliminate mfence's immediately preceding and immediately following locked instructions, typically in the context of mfence-atomic_op-mfence.  The compiler does this automatically, and could be causing your missing mfence.

The only context where you really want to generate an mfence is where you need to prevent two loads (not from the same address) from being commuted.  I'm sure there's some scenario where doing so breaks sequential consistency, but I can't come up with one off the top of my head.

--Owen

[1] http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf
[2] http://support.amd.com/us/Processor_TechDocs/24593.pdf