[PATCH] D23933: [XRay] ARM 32-bit no-Thumb support in compiler-rt

Tue Aug 30 13:18:07 PDT 2016

rSerge marked an inline comment as done.
rSerge added a comment.

I've responded inline and will upload a new diff in a minute.

> You might also consider extending the file header type to indicate what platform the trace was generated from, so tools can determine what to do with a trace that comes from a specific CPU. I'm happy to do it later, but wanted to know your thoughts on how we might encode that information appropriately.


I think we should look for some enumeration in compiler-rt or LLVM listing the CPU architectures, so to make XRay CPU codes consistent with that. But so far `XRayFileHeader` and `XRayRecord` don't seem to differ between CPUs, or do they?


================
Comment at: lib/sanitizer_common/scripts/gen_dynamic_list.py:54
@@ -52,3 +53,3 @@
   # On PowerPC, nm prints function descriptors from .data section.
-  if os.uname()[4] in ["powerpc", "ppc64"]:
+  if platform.uname()[4] in ["powerpc", "ppc64"]:
     func_symbols += ['D']
----------------
rengolin wrote:
> Have you tested this on Linux and Mac? To make sure it also work there?
I can test on Ubuntu. I don't have access to a Mac.

================
Comment at: lib/xray/xray_arm.cc:22
@@ +21,3 @@
+// basis. See xray_trampoline_*.s files for implementations.
+extern void __xray_FunctionEntry();
+extern void __xray_FunctionExit();
----------------
rengolin wrote:
> Can't you define this on a top-level header and implement on an arch-specific cpp file?
> 
> I don't think these things should be changing between arches.
Moving to xray_interface_internal.h .

================
Comment at: lib/xray/xray_arm.cc:29-32
@@ +28,6 @@
+// The machine codes for some instructions used in runtime patching.
+enum class PatchOpcodes : uint32_t
+{
+  PO_Push_r0_lr = 0xE92D4001,
+  PO_Blx_ip = 0xE12FFF3C,
+  PO_Pop_r0_lr = 0xE8BD4001,
----------------
dberris wrote:
> I think the coding standards say they should look like types, so just CamelCase. I don't make up the rules, but I just try to follow them. :)
Ok, changing to CamelCase.

================
Comment at: lib/xray/xray_arm.cc:38
@@ +37,3 @@
+// 0xUUUUWXYZ -> 0x000W0XYZ
+inline static uint32_t GetMovwMask(const uint32_t Value) {
+  return (Value & 0xfff) | ((Value & 0xf000) << 4);
----------------
rengolin wrote:
> dberris wrote:
> > Shouldn't these functions be camelCase? As in `getMoveMask(...)` according to the guide? I understand you're just following the conventions of the files around, and those mistakes are mine -- but do you mind changing them before landing?
> Yes, please. Let's not add different styles if we don't have to. Camel case, caps for variables, no caps for functions.
> 
> Enum values have format INI_Name, with "INI" the initials of the enum's name, all caps, and the Name a unique identifier within the enum.
The nearby code of sanitizers (ASAN, sanitizer_common) mostly names the functions starting with a capital letter. Do you still think I should name functions starting with a lowercase letter?

================
Comment at: lib/xray/xray_arm.cc:50
@@ +49,3 @@
+//   MOVT r0, #<higher 16 bits of the |Value|>
+inline static uint32_t *Write32bitLoadR0(uint32_t *Address,
+                                         const uint32_t Value) {
----------------
rengolin wrote:
> Why haven't use used inline assembly, here? This is really unreadable and error prone.
Because it is not possible. This code patches the user program at runtime with different instructions depending on the data in the user program. There doesn't seem anything we can put as inline assembly in compiler-rt code. It may be possible to use assembly strings, but that would require to link an assembler to the user program.

================
Comment at: lib/xray/xray_arm.cc:110
@@ +109,3 @@
+    CurAddress++;
+    *CurAddress = uint32_t(PatchOpcodes::PO_Pop_r0_lr);
+    std::atomic_store_explicit(
----------------
rengolin wrote:
> All modern ARM CPUs are multi-issue, out-of-order, so you cannot guarantee ordering without a data/memory barrier. ARMv8 has better atomic support.
I used `memory_order_release` here, on the writer side. According to C++11 standard, this should prevent reordering previous writes past this point (inserting fences if necessary, etc.).
There is little we can do on the reader side, as there is no data reader: it is the CPU fetching instructions on the other side. Is there any evidence that ARM may fetch instructions out of order? If so, how to prevent this?

================
Comment at: lib/xray/xray_inmemory_log.cc:30
@@ +29,3 @@
+#elif defined(__arm__)
+  static const int64_t cNanosecondsPerSecond = 1000LL*1000*1000;
+#else
----------------
rengolin wrote:
> dberris wrote:
> > `static constexpr` instead?
> > 
> > Also, please follow the naming conventions for this one too.
> > 
> > Another thing -- couldn't you just use std::chrono for the constant here?
> > 
> > http://en.cppreference.com/w/cpp/chrono/duration
> Better still, get the proper value?
> 
> I know it can be a tad different from hardware to hardware, but not even trying isn't really helpful.
What do you mean by getting the proper value?
I think that getting here anything other than just simple "1 billion" would be too unexpected, and we would need error checking for that. Furthermore, getting it from other compile units may result in initialization order issues. It is easier and more reliable to just have it as a 1 billion constant.

================
Comment at: lib/xray/xray_inmemory_log.cc:30
@@ +29,3 @@
+#elif defined(__arm__)
+  static const int64_t cNanosecondsPerSecond = 1000LL*1000*1000;
+#else
----------------
rSerge wrote:
> rengolin wrote:
> > dberris wrote:
> > > `static constexpr` instead?
> > > 
> > > Also, please follow the naming conventions for this one too.
> > > 
> > > Another thing -- couldn't you just use std::chrono for the constant here?
> > > 
> > > http://en.cppreference.com/w/cpp/chrono/duration
> > Better still, get the proper value?
> > 
> > I know it can be a tad different from hardware to hardware, but not even trying isn't really helpful.
> What do you mean by getting the proper value?
> I think that getting here anything other than just simple "1 billion" would be too unexpected, and we would need error checking for that. Furthermore, getting it from other compile units may result in initialization order issues. It is easier and more reliable to just have it as a 1 billion constant.
I'm renaming this to get rid of `c` prefix.
I think that pulling the whole `chrono` just for nanoseconds per second number may be a waste of compile time.

================
Comment at: lib/xray/xray_inmemory_log.cc:70
@@ -62,2 +69,3 @@
 
+#if !defined(__arm__)
 static std::pair<ssize_t, bool> retryingReadSome(int Fd, char *Begin,
----------------
rengolin wrote:
> I'd use #defined __x86_64__ instead, and replicate to all arches it's supposed to work later.
> 
> We don't want broken fall-back logic, as it's really hard to find bugs later.
Changing.

================
Comment at: lib/xray/xray_inmemory_log.cc:188
@@ +187,3 @@
+#elif defined(__arm__)
+    CPUFrequency = cNanosecondsPerSecond;
+#else
----------------
rengolin wrote:
> I don't understand this... Is this just hard-coding to 1GHz?
> 
>     /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> 
> Also works on most ARM and AArch64 boards.
No, this is not hard-coding to 1GHz. x86_64 uses `RDTSCP` instruction in the numerator, that is why the denominator is CPU frequency. There is nothing similar for ARM in user mode. So we fall back on `clock_gettime()` system call. It provides time in nanoseconds. That is why we use 1 billion as the denominator. Shall I rename the variable from `CPUFrequency` to something like `TicksPerSecond` so that it is more comprehensive on CPUs without instructions like `RDTSC`?

================
Comment at: lib/xray/xray_trampoline_arm.S:1
@@ +1,2 @@
+    .syntax unified
+	.code 32
----------------
rengolin wrote:
> You're going to need more than that.
> 
> Assemblers are very picky on what's valid, and Clang is specially so.
> 
> You'll need to put the minimum requirements on the header (cpu, fpu, arch, thumb-interop, etc).
> 
> You'll also need to put ".arch" to extend support on functions that use new instructions where the header flags don't support.
> 
> Example:
> 
>     .syntax unifixed
>     .arch   armv6t2
>     .fpu    vfpv2
>     
>     ...
>     
>     v7_only_func:
>     .fpu vfpv3
>     VMOV ...
> 
> This will mean you can use this on v6T2 onward, and that "v7_only_func" can only be used by arches with vfpv3, guaranteed by the dynamic dispatch. See libunwind and other compiler-RT ARM functions for this behaviour.
> 
I guess that the dynamic dispatch doesn't help us, because we are calling the function from machine code written at run-time into the code of user functions.
Adding `.arch armv7` and `.fpu vfpv3` .

================
Comment at: lib/xray/xray_trampoline_arm.S:6
@@ +5,3 @@
+    // It preserves all registers except r0, r12(ip), r14(lr) and r15(pc)
+    // Assume that "q" part of the floating-point registers is not used in C/C++
+	.global __xray_FunctionEntry // let C/C++ see the symbol
----------------
rengolin wrote:
> Why not? Vectorizers can and do use Q regs...
That can happen. But are `Q` registers used for passing parameters and returning values?
Perhaps my assembly comment is misleading: here (in `__xray_FunctionEntry`) we need to push&pop every register which may be used for passing parameters. And in `__xray_FunctionExit` we need to push&pop every register which may be used for returning values from C/C++ functions.

================
Comment at: lib/xray/xray_trampoline_arm.S:7
@@ +6,3 @@
+    // Assume that "q" part of the floating-point registers is not used in C/C++
+	.global __xray_FunctionEntry // let C/C++ see the symbol
+	.type __xray_FunctionEntry, %function
----------------
rengolin wrote:
> The comment character is "@" not "//". This will only work if compiled by a C++ compiler, not an assembler.
Changing.


https://reviews.llvm.org/D23933