[PATCH] D58621: [XRay][tools] Pack XRayRecord - reduce memory footprint by a third. (RFC)
Roman Lebedev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Feb 25 07:21:46 PST 2019
lebedev.ri created this revision.
lebedev.ri added reviewers: dberris, kpw.
lebedev.ri added a project: LLVM.
Herald added subscribers: jdoerfert, courbet.
This is a RFC because of the `uint8_t CPU` change.
That chance needs discussing.
In "basic log mode", we indeed only ever read 8 bits into that field.
But in FDR mode, the CPU field in log is 16 bits.
But if you look in the compiler-rt part, as far as i can tell, the CPU id is always
(in both modes, basic and FDR) received from `uint64_t __xray::readTSC(uint8_t &CPU)`.
So naturally, CPU id is always only 8 bit, and in FDR mode, extra 8 bits is just padding.
Please don't take my word for it, do recheck!
=============================================
Thus, i do not believe we need to have `uint16_t` for `CPU`. With the other current code
we can't ever get more than `uint8_t` value there, thus we save 1 byte.
The rest of the patch is trivial.
By specifying the base type of `RecordTypes` we save 3 bytes.
`llvm::SmallVector<>`/`llvm::SmallString` only cost 16 bytes each, as opposed to 24/32 bytes.
Thus, in total, old `sizeof(XRayRecord)` was 88 bytes, and new one is 56 bytes.
There is no padding between the fields of `XRayRecord`, and `XRayRecord` itself isn't being
padded when stored into a vector. Thus the footprint of `XRayRecord` is now optimal.
This is important because `XRayRecord` is what has the biggest memory footprint,
and most contributes to the peak heap memory usage at least of `llvm-xray convert`.
Some numbers:
`xray-log.llvm-exegesis.FswRtO` was acquired from `llvm-exegesis`
(compiled with ` -fxray-instruction-threshold=128`)
analysis mode over `-benchmarks-file` with 10099 points (one full
latency measurement set), with normal runtime of 0.387s.
Time old:
$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO
Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-old.yml xray-log.llvm-exegesis.FswRtO' (9 runs):
7607.69 msec task-clock # 0.999 CPUs utilized ( +- 0.48% )
522 context-switches # 68.635 M/sec ( +- 39.85% )
1 cpu-migrations # 0.073 M/sec ( +- 60.83% )
77905 page-faults # 10241.090 M/sec ( +- 3.13% )
30471867671 cycles # 4005708.241 GHz ( +- 0.48% ) (83.32%)
2424264020 stalled-cycles-frontend # 7.96% frontend cycles idle ( +- 1.84% ) (83.30%)
11097550400 stalled-cycles-backend # 36.42% backend cycles idle ( +- 0.35% ) (33.38%)
36899274774 instructions # 1.21 insn per cycle
# 0.30 stalled cycles per insn ( +- 0.07% ) (50.04%)
6538597488 branches # 859537529.125 M/sec ( +- 0.07% ) (66.70%)
79769896 branch-misses # 1.22% of all branches ( +- 0.67% ) (83.35%)
7.6143 +- 0.0371 seconds time elapsed ( +- 0.49% )
Time new:
$ perf stat -r9 ./bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO
Performance counter stats for './bin/llvm-xray convert -sort -symbolize -instr_map=./bin/llvm-exegesis -output-format=trace_event -output=/tmp/trace-new.yml xray-log.llvm-exegesis.FswRtO' (9 runs):
7207.49 msec task-clock # 1.000 CPUs utilized ( +- 0.46% )
174 context-switches # 24.159 M/sec ( +- 30.10% )
0 cpu-migrations # 0.062 M/sec ( +- 39.53% )
52126 page-faults # 7232.740 M/sec ( +- 0.69% )
28876446408 cycles # 4006783.905 GHz ( +- 0.46% ) (83.31%)
2352902586 stalled-cycles-frontend # 8.15% frontend cycles idle ( +- 2.08% ) (83.33%)
8986901047 stalled-cycles-backend # 31.12% backend cycles idle ( +- 1.00% ) (33.36%)
38630170181 instructions # 1.34 insn per cycle
# 0.23 stalled cycles per insn ( +- 0.04% ) (50.02%)
7016819734 branches # 973626739.925 M/sec ( +- 0.04% ) (66.68%)
86887572 branch-misses # 1.24% of all branches ( +- 0.39% ) (83.33%)
7.2099 +- 0.0330 seconds time elapsed ( +- 0.46% )
(Nice, accidentally improved by -5%)
Memory old:
$ heaptrack_print heaptrack.llvm-xray.3976.gz | tail -n 7
total runtime: 18.16s.
bytes allocated in total (ignoring deallocations): 5.25GB (289.03MB/s)
calls to allocation functions: 21840309 (1202792/s)
temporary memory allocations: 228301 (12573/s)
peak heap memory consumption: 354.62MB
peak RSS (including heaptrack overhead): 4.30GB
total memory leaked: 87.42KB
Memory new:
$ heaptrack_print heaptrack.llvm-xray.5234.gz | tail -n 7
total runtime: 17.93s.
bytes allocated in total (ignoring deallocations): 5.05GB (281.73MB/s)
calls to allocation functions: 21840309 (1217747/s)
temporary memory allocations: 228301 (12729/s)
peak heap memory consumption: 267.77MB
peak RSS (including heaptrack overhead): 2.16GB
total memory leaked: 83.50KB
Memory diff:
$ heaptrack_print -d heaptrack.llvm-xray.3976.gz heaptrack.llvm-xray.5234.gz | tail -n 7
total runtime: -0.22s.
bytes allocated in total (ignoring deallocations): -195.36MB (876.07MB/s)
calls to allocation functions: 0 (0/s)
temporary memory allocations: 0 (0/s)
peak heap memory consumption: -86.86MB
peak RSS (including heaptrack overhead): 0B
total memory leaked: -3.92KB
So we indeed improved (reduced) peak memory usage, by ~-25%.
Not by a third since now something else is the top contributor to the peak.
Repository:
rL LLVM
https://reviews.llvm.org/D58621
Files:
include/llvm/XRay/XRayRecord.h
include/llvm/XRay/YAMLXRayRecord.h
lib/XRay/Trace.cpp
tools/llvm-xray/xray-converter.cpp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D58621.188162.patch
Type: text/x-patch
Size: 4027 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190225/cc5fde6e/attachment-0001.bin>
More information about the llvm-commits
mailing list