[PATCH] D23933: [XRay] ARM 32-bit no-Thumb support in compiler-rt

Thu Sep 1 07:24:56 PDT 2016

rengolin added inline comments.

================
Comment at: lib/xray/xray_arm.cc:112
@@ +111,3 @@
+        reinterpret_cast<std::atomic<uint32_t> *>(FirstAddress), 
+        uint32_t(PatchOpcodes::PO_PushR0Lr), std::memory_order_release);
+  } else {
----------------
Right, it's a bit more complicated than that...

A good quick source of all factors: https://community.arm.com/groups/processors/blog/2011/03/22/memory-access-ordering--an-introduction

> But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.

So, you need to tell the other cores to wait until you write, then you need to store-release, then they can fetch. Otherwise, they'll fetch NOPs.

> Can it fetch instruction at pc+4 earlier than the instruction at pc?

In theory, no. In practice, maybe.

ARM has separate caches for code and data. If core0 reads 'pc' - 16, and the Icache line is, say, 32, then the NOPs are in core0's cache. Before core0 reaches the 'pc', core1 gets it, sets a load-acquire, and jumps to your thunk. At that time, you really want core0 to *stop* before reaching that specified 'pc', or it'll execute NOPs. Once core1 has written its shim, it then store-releases and core0 can continue, now executing your inserted code.

In summary, you *need* a barrier. Since this is about code fetching, you need an instruction barrier (ISB) not a data barrier (DMB).

> ARM CPU is always fetching the instruction at pc, decoding the instruction at pc-4 and executing the instruction at pc-8.

On the same core, instructions are (again, in theory) fetched and decoded "in order", but they're stored in a queue, which gets dispatched at any convenient time. So there is no concept of 'pc+8' at all. The cores will also speculatively fetch, decode and even execute (ex. branch prediction, peephole, etc).

So, there is absolutely *no* guarantee that any instruction will be fetched, decoded or executed before another, unless they have a strict dependency relationship, either by data dependency, atomic instructions or barriers.

================
Comment at: lib/xray/xray_inmemory_log.cc:188
@@ +187,3 @@
+#elif defined(__arm__)
+    // There is no instruction like RDTSCP in user mode on ARM. So we use
+    //   clock_gettime() which gives the result in nanoseconds. To get the
----------------
Ah, I see. I didn't know RDTSCP had a fixed frequency. In that case, a comment explaining it would be most welcome.

================
Comment at: lib/xray/xray_trampoline_arm.S:35
@@ +34,3 @@
+    POP {r1-r3,pc}
+
+    @ Word-aligned function entry point
----------------
Sorry, ignore me.

https://reviews.llvm.org/D23933