[cfe-commits] patch: libcxxabi cxa__virtual and cxa_guard_ methods

Mon May 23 16:11:28 PDT 2011

On May 23, 2011, at 7:13 AM, Howard Hinnant wrote:
> On May 23, 2011, at 3:22 AM, John McCall wrote:
>> 
>>> An assumption (and clang developers will have to tell me if it is correct or not) is that the compiler will do a proper double-checked locking dance prior to calling __cxa_guard_acquire.
>> 
>> I'm not sure what code you're imagining that the compiler might emit here;  if the compiler emitted a full and proper double-checked locking dance, it would not need to call __cxa_guard_acquire.  The compiler has the option of emitting an unprotected check that the first byte is non-zero, subject only to the restriction that accesses to the protected global cannot be re-ordered around that check.  It's not required to do any such thing, though.
> 
> Yes, that was what I was thinking, but also on some platforms (almost certainly not intel) an atomic read may be necessary.  I'm weak in this area though.  
> 
>> 
>> In short, for a multitude of reasons, __cxa_guard_acquire must be prepared for the possibility that the variable has already been initialized by the time it grabs the lock.
> 
> Oh, yes, absolutely.  The danger I'm worried about is the compiler getting a false '1' out of the first byte (using a non-atomic read) before the initializing thread is quite done.  I'm always confused as to how that might happen, but have been repeatedly warned by those more knowledgeable than me in this area, that it is possible on platforms with sufficiently weak memory ordering.

Yes.  Specifically, a processor has to be able to see writes made by a different processor out of order (writes to non-conflicting objects, obviously).  I'm going to break this down because it turns out to matter.

Imagine that Alice and Bob are racing to initialize a variable, and Bob happens to get there first.

Just in terms of writes, Bob is doing this:
  store x to _variable
  store 1 to _guard
Alice is doing this:
  tmp = load _guard
  goto skip if tmp != 0
  try to acquire guard, initialize, etc.
  y = load _variable
and we want to guarantee that x == y.  The way we prove this is to set things up such that if [store _guard < load _guard] (because the load read the value written by the store) then the processor's memory model guarantees that [store _variable < load _variable].

The reason that x86 doesn't require anything special here is that x86 promises not to "reorder" stores after stores or loads after loads;  that is, it guarantees that sequenced stores will become visible to other processors in the same order, and it guarantees that sequenced loads will see stores in an order consistent with the order they became visible.  So we automatically get that [store _variable < store _guard] and [load _guard < load _variable], and we don't need any fences.

However, the implications for hardware are that, e.g., a processor that needs to replace a dirty cache line must also publish every write it's made up to that point, and a processor that has a cache read miss has to make sure that the rest of its cache is up-to-date.  Those are pretty strong guarantees, so some architectures use weaker rules.  That means that, without barriers, Bob might publish his store to _guard before he publishes his store to _variable, or Alice might just use a cached version of _variable which doesn't include Bob's write.  So to make this work, Bob has to do a write barrier (effectively, publishing all his stores), and Alice has to do a read barrier (effectively, updating or invalidating her cache).  That gives us that both [store _variable < write barrier < store _guard] and [load _guard < read barrier < load _variable], which is what we need.

So what the compiler has to do is to make sure that it emits a read barrier along the fast path, or else we might get a stale read;  but __cxa_guard_release also needs to perform a write barrier.  There are barriers happening as part of the mutex logic, but conventionally, acquiring a mutex only performs a read barrier and releasing it performs a write barrier, and the mutex release happens *after* the write to guard.  So I think you either need an extra write barrier before the store to guard, or you need to move guard out of the mutex (with all the intendant complexity).

John.

[cfe-commits] patch: libcxxabi cxa_*_virtual and cxa_guard_* methods

[cfe-commits] patch: libcxxabi cxa__virtual and cxa_guard_ methods