[PATCH] D23933: [XRay] ARM 32-bit no-Thumb support in compiler-rt

Wed Aug 31 12:13:31 PDT 2016

rSerge marked 2 inline comments as done.

================
Comment at: lib/xray/xray_arm.cc:39
@@ +38,3 @@
+}
+
+// Writes the following instructions:
----------------
rengolin wrote:
> This is a new file, it should use LLVM's policy.
Changing.

================
Comment at: lib/xray/xray_arm.cc:51
@@ +50,3 @@
+  return Address + 1;
+}
+
----------------
rengolin wrote:
> Of course. Ignore me.
> 
> Though, this is the same as the one below, and you could merge them both by passing the register name and ORRing [reg << 12] with the instruction, and making sure reg < 15.
Ok, changing.

================
Comment at: lib/xray/xray_arm.cc:111
@@ +110,3 @@
+        uint32_t(PatchOpcodes::PO_B20), std::memory_order_release);
+  }
+  return true;
----------------
rengolin wrote:
> > Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?
> 
> I'm not sure what you mean. Many Cortex-AR cores are OOO. That's their design, you can't change that. Or maybe you mean "out of order amongst threads", which is not what I'm talking about.
> 
> Since this is in C++, so I'm guessing the compiler will "do the right thing" (tm) with regards to memory barriers, and the core being OOO makes no difference here.
> 
> Probably just a nomenclature clash around "OOO" between ourselves... :)
You are thinking about data: the CPU executes out of order the instructions which manipulate data. On the data side we only write, and `memory_order_release` should prevent reordering. But we write CPU instructions, which another core may be fetching, decoding and executing concurrently with our writes.
So I mean the scenario where the  CPU is reading instructions themselves from the code segment ("fetching") in order to then decode the instructions and finally execute. Can it fetch instruction at `pc+4` earlier than the instruction at `pc`?
As I understood from ARM specification, it can't: ARM CPU is always fetching the instruction at `pc`, decoding the instruction at `pc-4` and executing the instruction at `pc-8`.

================
Comment at: lib/xray/xray_inmemory_log.cc:188
@@ +187,3 @@
+#elif defined(__arm__)
+    CPUFrequency = NanosecondsPerSecond;
+#else
----------------
rengolin wrote:
> I still find this confusing... Is this 10^9 just a normalising factor, to get compatible numbers? If anything, this line needs a serious comment explaining why this is what it is.
> 
> Also, clock_gettime() will return a system wide, sequential and consistent number, while RDTSCP will return a counter that is internal to each CPU (and will be different across CPUs), thus prone to problems while context-switching.
> 
> Regardless, if you want CPU frequency, you can do exactly what you've done to x86.
10^9 is the number of nanoseconds per second. It can be viewed as a normalizing factor, to get measurements in seconds.
I would prefer something of higher resolution than clock_gettime() (they say on the internet that its resolution is only 1ms, while `RDTSCP` resolution is aroun 1ns), but I don't know how to do it on ARM. I searched on the internet and figured out that the cycle counter on ARM is 1) not available in user mode 2) changes frequency when CPU frequency changes. In contrast, RDTSCP on x86 is available in user mode and has a constant frequency, independent on CPU power-saving / turbo frequency adjustments.
I'm adding a comment. 

================
Comment at: lib/xray/xray_trampoline_arm.S:34
@@ +33,3 @@
+    VPOP {d0-d7}
+    POP {r1-r3,pc}
+
----------------
rengolin wrote:
> A8.8.132 POP (ARM):
> 
>     "ARM deprecates the use of this instruction with both the LR and the PC in the list."
The list contains only `pc`, not both `lr` and `pc`.

https://reviews.llvm.org/D23933