[PATCH] D104060: Machine IR Profile

Mon Jun 14 16:33:26 PDT 2021

MaskRay added a comment.

I have played with the patch.

`-fmachine-profile-generate` only inserts a `__llvm_mip_call_counts_caller` function call. There is no basic block instrumentation, so this is just the function entry count coverage mode.

`-fmachine-profile-generate -fmachine-profile-function-coverage` changes the `__llvm_mip_call_counts_caller` call to `movb $0, counter(%rip)`.
So this matches the traditional binary coverage mode.
This mode is supported by `clang -fsanitize-coverage=func,inline-bool-flag,pc-table`.
inline-bool-flag uses a conditional set because IIUC under concurrency this is faster than a racy write.
If needed `-fsanitize-coverage=inline-bool-flag` can introduce a mode to use a racy write.

`-fmachine-profile-generate -fmachine-profile-block-coverage` inserts `movb $0, counter(%rip)` for machine basic blocks.
This is a vertex profile (less powerful than a edge profile).
This mode is supported by `clang -fsanitize-coverage=edge,inline-bool-flag,pc-table`
Mapping the information to source files will require debug info (-g1).

Traditional gcc/clang coverage features (-fprofile-arcs/-fprofile-instr-generate/-fprofile-generate) are all about edge profiles and use word-size counters.
If the size is a concern, it is probably reasonable to use 32-bit counters, but smaller counters may not be suitable for PGO.

---

`__llvm_mip_call_counts_caller` is slow.
It is a function with a custom call convention using RAX as the argument on x86-64.
The impl detail function saves and restores many vector registers.
I haven't studied why `__llvm_mip_call_counts_caller` is needed.

---

`__llvm_prf_data` (-fprofile-generate, -fprofile-instr-generate) vs `__llvm_mipmap` (-fmachine-profile-generate)

In the absence of value profiling, `__llvm_prf_data` uses:

  .quad NameRef
  .quad FuncHash
  .quad .L__profc_fun
  .quad fun   # need a dynamic relocation; used by raw profile reader
  .quad 0     # value profiling
  .long NumCounters
  .long 0     # value profiling, 2 unused value sites

If we want to save size for small linked images, we can change some `.quad` to `.long`.
e.g. if the number of functions is smaller than 2**16 (or slightly larger), we can use a 32-bit hash.
`.L__profc_fun` can use .long if the size cannot overflow 32-bit.

Note that 2 fields are only used by value profiling.

`__llvm_mipmap` has these fields. I added an inline comment that -shared doesn't work.

          .section        __llvm_mipmap,"aw", at progbits
          .globl  _Z3fooPiS_$MAP
          .p2align        3
  _Z3fooPiS_$MAP:
  .Lref2:
    ### not sure why this is needed
          .long   __start___llvm_mipraw-.Lref2    # Raw Section Start PC Offset

    ##### this does not link in -fpic -shared mode
          .long   _Z3fooPiS_$RAW-.Lref2           # Raw Profile Symbol PC Offset

          .long   _Z3fooPiS_-.Lref2               # Function PC Offset
          .long   .Lmip_func_end0-_Z3fooPiS_      # Function Size
          .long   0x0                             # CFG Signature
          .long   0                               # Non-entry Block Count
          .long   10                              # Function Name Length
          .ascii  "_Z3fooPiS_"

---

Some of my understanding about -fprofile-instr-generate vs -fprofile-generate

In clang you have full information about line/column/region information.
So -fprofile-instr-generate works well for coverage because the frontend is not at a good position applying various optimizations.

For instance, the important Kirichoff's circult law (aka spanning tree) optimization is not implemented. (I added the optimization to clang -fprofile-generate).
So in bad cases (e.g. libvpx) -fprofile-instr-generate can be 15% slower than -fprofile-arcs/-fprofile-generate.

The loop optimization (instead of adding a counter N times, add N to it) cannot be enabled.
The benefit is relatively small, though.

The frontend cannot apply inlining or some early optimizations to greatly decrease the number of counters.

Instrumenting machine basic blocks feels awkward to me.
Now much semantic information is lost. The loop optimization definitely cannot be applied.
Edge profiling is tricky. Edge profiling requires splitting critical edges - it is not clear how you can do this after the machine basic block layout is finalized.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104060/new/

https://reviews.llvm.org/D104060