[PATCH] D58621: [XRay][tools] Pack XRayRecord - reduce memory footprint by a third. (RFC)

Mon Feb 25 07:21:46 PST 2019

lebedev.ri created this revision.
lebedev.ri added reviewers: dberris, kpw.
lebedev.ri added a project: LLVM.
Herald added subscribers: jdoerfert, courbet.

This is a RFC because of the `uint8_t CPU` change.
That chance needs discussing.

In "basic log mode", we indeed only ever read 8 bits into that field.
But in FDR mode, the CPU field in log is 16 bits.
But if you look in the compiler-rt part, as far as i can tell, the CPU id is always
(in both modes, basic and FDR) received from `uint64_t __xray::readTSC(uint8_t &CPU)`.
So naturally, CPU id is always only 8 bit, and in FDR mode, extra 8 bits is just padding.

Please don't take my word for it, do recheck!
=============================================

Thus, i do not believe we need to have `uint16_t` for `CPU`. With the other current code
we can't ever get more than `uint8_t` value there, thus we save 1 byte.

The rest of the patch is trivial.
By specifying the base type of `RecordTypes` we save 3 bytes.

`llvm::SmallVector<>`/`llvm::SmallString` only cost 16 bytes each, as opposed to 24/32 bytes.

Thus, in total, old `sizeof(XRayRecord)` was 88 bytes, and new one is 56 bytes.
There is no padding between the fields of `XRayRecord`, and `XRayRecord` itself isn't being
padded when stored into a vector. Thus the footprint of `XRayRecord` is now optimal.

This is important because `XRayRecord` is what has the biggest memory footprint,
and most contributes to the peak heap memory usage at least of `llvm-xray convert`.

Some numbers:

`xray-log.llvm-exegesis.FswRtO` was acquired from `llvm-exegesis`
(compiled with ` -fxray-instruction-threshold=128`)
analysis mode over `-benchmarks-file` with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.

Time old:

  $ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO 

   Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO' (9 runs):

             7607.69 msec task-clock                #    0.999 CPUs utilized            ( +-  0.48% )
                 522      context-switches          #   68.635 M/sec                    ( +- 39.85% )
                   1      cpu-migrations            #    0.073 M/sec                    ( +- 60.83% )
               77905      page-faults               # 10241.090 M/sec                   ( +-  3.13% )
         30471867671      cycles                    # 4005708.241 GHz                   ( +-  0.48% )  (83.32%)
          2424264020      stalled-cycles-frontend   #    7.96% frontend cycles idle     ( +-  1.84% )  (83.30%)
         11097550400      stalled-cycles-backend    #   36.42% backend cycles idle      ( +-  0.35% )  (33.38%)
         36899274774      instructions              #    1.21  insn per cycle         
                                                    #    0.30  stalled cycles per insn  ( +-  0.07% )  (50.04%)
          6538597488      branches                  # 859537529.125 M/sec               ( +-  0.07% )  (66.70%)
            79769896      branch-misses             #    1.22% of all branches          ( +-  0.67% )  (83.35%)

              7.6143 +- 0.0371 seconds time elapsed  ( +-  0.49% )

Time new:

  $ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO 

   Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO' (9 runs):

             7207.49 msec task-clock                #    1.000 CPUs utilized            ( +-  0.46% )
                 174      context-switches          #   24.159 M/sec                    ( +- 30.10% )
                   0      cpu-migrations            #    0.062 M/sec                    ( +- 39.53% )
               52126      page-faults               # 7232.740 M/sec                    ( +-  0.69% )
         28876446408      cycles                    # 4006783.905 GHz                   ( +-  0.46% )  (83.31%)
          2352902586      stalled-cycles-frontend   #    8.15% frontend cycles idle     ( +-  2.08% )  (83.33%)
          8986901047      stalled-cycles-backend    #   31.12% backend cycles idle      ( +-  1.00% )  (33.36%)
         38630170181      instructions              #    1.34  insn per cycle         
                                                    #    0.23  stalled cycles per insn  ( +-  0.04% )  (50.02%)
          7016819734      branches                  # 973626739.925 M/sec               ( +-  0.04% )  (66.68%)
            86887572      branch-misses             #    1.24% of all branches          ( +-  0.39% )  (83.33%)

              7.2099 +- 0.0330 seconds time elapsed  ( +-  0.46% )

(Nice, accidentally improved by -5%)

Memory old:

  $ heaptrack_print heaptrack.llvm-xray.3976.gz | tail -n 7
  total runtime: 18.16s.
  bytes allocated in total (ignoring deallocations): 5.25GB (289.03MB/s)
  calls to allocation functions: 21840309 (1202792/s)
  temporary memory allocations: 228301 (12573/s)
  peak heap memory consumption: 354.62MB
  peak RSS (including heaptrack overhead): 4.30GB
  total memory leaked: 87.42KB

Memory new:

  $ heaptrack_print heaptrack.llvm-xray.5234.gz | tail -n 7
  total runtime: 17.93s.
  bytes allocated in total (ignoring deallocations): 5.05GB (281.73MB/s)
  calls to allocation functions: 21840309 (1217747/s)
  temporary memory allocations: 228301 (12729/s)
  peak heap memory consumption: 267.77MB
  peak RSS (including heaptrack overhead): 2.16GB
  total memory leaked: 83.50KB

Memory diff:

  $ heaptrack_print -d heaptrack.llvm-xray.3976.gz heaptrack.llvm-xray.5234.gz | tail -n 7
  total runtime: -0.22s.
  bytes allocated in total (ignoring deallocations): -195.36MB (876.07MB/s)
  calls to allocation functions: 0 (0/s)
  temporary memory allocations: 0 (0/s)
  peak heap memory consumption: -86.86MB
  peak RSS (including heaptrack overhead): 0B
  total memory leaked: -3.92KB

So we indeed improved (reduced) peak memory usage, by ~-25%.
Not by a third since now something else is the top contributor to the peak.

Repository:
  rL LLVM

https://reviews.llvm.org/D58621

Files:
  include/llvm/XRay/XRayRecord.h
  include/llvm/XRay/YAMLXRayRecord.h
  lib/XRay/Trace.cpp
  tools/llvm-xray/xray-converter.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D58621.188162.patch
Type: text/x-patch
Size: 4027 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190225/cc5fde6e/attachment-0001.bin>