[LLVMdev] ASM output with JIT / codegen barriers

Tue Jan 5 05:32:06 PST 2010

On Tue, Jan 5, 2010 at 12:09 AM, Chandler Carruth <chandlerc at google.com> wrote:
> On Mon, Jan 4, 2010 at 8:51 PM, Jeffrey Yasskin <jyasskin at google.com> wrote:
>> On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net> wrote:
>>>> The important things here are:
>>>> 1) Stores cannot be migrated from within the MOV/XOR instructions to outside
>>>> by the codegen.
>>>
>>> Basically, this is merely the problem that x86 places a stricter
>>> requirement on memory ordering than LLVM. Where x86 requires that
>>> stores occur in program order, LLVM reserves the right to change that.
>>> I have no idea if it is worthwhile to support memory barriers solely
>>> within the flow of execution, but it seems highly suspicious.
>>
>> It's needed to support std::atomic_signal_fence. gcc will initially
>> implement that with
>>  asm volatile("":::"memory")
>> but as James points out, that kills the JIT, and probably will keep
>> doing so until llvm-mc is finished or someone implements a special
>> case for it.
>
> Want to propose an extension to the current atomics of LLVM? Could we
> potentially clarify your previous concern regarding the pairing of
> barriers to operations, as it seems like they would involve related
> bits of the lang ref? Happy to work with you on that sometime this Q
> if you're interested; I'll certainly have more time. =]

I have some ideas for that, and will be happy to help.

>>>>> The processor can reorder memory operations as well (within limits).
>>>>> Consider that 'memset' to zero is often codegened to a non-temporal
>>>>> store to memory. This exempts it from all ordering considerations
>>>>
>>>> My understanding is that processor reordering only affects what you might
>>>> see from another CPU: the processor will undo speculatively executed
>>>> operations if the sequence of instructions actually executed is not the
>>>> sequence it predicted, so within a single CPU you should never be able tell
>>>> the difference.
>>>>
>>>> But I must admit I don't know anything about non-temporal stores. Within a
>>>> single thread, if I do a non-temporal store, followed by a load, am I not
>>>> guaranteed to get back the value I stored?
>>>
>>> If you read the *same address*, then the ordering is guaranteed, but
>>> the Intel documentation specifically exempts these instructions from
>>> the general rule that writes will not be reordered with other writes.
>>> This means that a non-temporal store might be reordered to occur after
>>> the "xor" to your atomic integer, even if the instruction came prior
>>> to the xor.
>>
>> It exempts these instructions from the cross-processor guarantees, but
>> I don't see anything saying that, for example, a temporal store in a
>> single processor's instruction stream after a non-temporal store may
>> be overwritten by the non-temporal store. Do you see something I'm
>> missing? If not, for single-thread signals, I think it's only compiler
>> reordering James has to worry about.
>
> The exemption I'm referring to (Section 8.2.2 of System Programming
> Guide from Intel) is to the write-write ordering of the
> *single-processor* model. Reading the referenced section on the
> non-temporal behavior for these instructions (10.4.6 of volume 1 of
> the architecture manual) doesn't entirely clarify the matter for me
> either. It specifically says that the non-temporal writes may occur
> outside of program order, but doesn't seem clarify exactly what the
> result is of overlapping temporal writes are without fences within the
> same program thread. The only examples I'm finding are for
> multiprocessor scenarios. =/

Yeah, it's not 100% clear. I'm pretty sure that x86 maintains the
fiction of a linear "instruction stream" within each processor, even
in the presence of interrupts (which underly pthread_kill and OS-level
thread switching). For example, in 6.6, we have "The ability of a P6
family processor to speculatively execute instructions does not affect
the taking of interrupts by the processor. Interrupts are taken at
instruction boundaries located during the retirement phase of
instruction execution; so they are always taken in the “in-order”
instruction stream."

But I'm not an expert in non-temporal anything.