[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
Duncan P. N. Exon Smith
dexonsmith at apple.com
Tue Oct 14 11:40:02 PDT 2014
> On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far.
To be clear, I'm running the command-line:
$ llvm-lto -exported-symbol _main llvm-lto.lto.bc
Since this is a pre-linked bitcode file, we shouldn't be wasting much
memory from the linking stage.
Running ld64 directly gives a peak memory footprint of ~30GB for the
full link, so there's something else going on there that I'll be
digging into later.
> 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4".
15.3GB is the peak memory of `llvm-lto`. This comes late in the
process, after DIEs have been created. I haven't looked in detail past
debug info metadata, but here's a sketch of what I imagine is in memory
at this point.
- The IR, including uniquing side-tables.
- Optimization and backend passes.
- Parts of SelectionDAG that haven't been freed.
- `MachineFunction`s and everything inside them.
- Whatever state the `AsmPrinter`, etc., need.
I expect to look at a couple of other debug-info-related memory usage
areas once I've shrunk the metadata:
- What's the total footprint of DIEs? This run has 4M of them, whose
allocated footprint is ~1GB. I'm hoping that a deeper look will
reveal an even larger attack surface.
- How much do debug info intrinsics cost? They show up in at least
three forms -- IR-level, SDNodes, and MachineInstrs -- and there
can be a lot of them. How many? What's their footprint?
For now, I'm focusing on the problem I've already identified.
> You need more data. Right now you have essentially one data point,
I looked at a number of internal C and C++ programs with -flto -g, and
dug deeply into llvm-lto.lto.bc because it's small enough that it's easy
to analyze (and its runtime profile was representative of the other C++
programs I was looking at).
I didn't look deeply at a broad spectrum, but memory usage and runtime
for building clang with -flto -g is something we care a fair bit about.
> and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number).
I'm not sure there's value in diving deeply into everything at once.
I've identified one of the bottlenecks, so I'd like to improve it before
digging into the others.
Here's some visibility into where my numbers come from.
I got the 15.3GB from a profile of memory usage vs. time. Peak usage
comes late in the process, around when DIEs are being dealt with.
Metadata node counts stabilize much earlier in the process. The rest of
the numbers are based on counting `MDNodes` and their respective
`MDNodeOperands`, and multiplying by the cost of their operands. Here's
a dump from around the peak metadata node count:
LineTables = 7500000, InlinedLineTables = 6756182, Directives = 7611669, Arrays = 570609, Others = 1176556
Tag = 256, Count = 554992, Ops = 2531428, Name = DW_TAG_auto_variable
Tag = 16647, Count = 988, Ops = 4940, Name = DW_TAG_GNU_template_parameter_pack
Tag = 52, Count = 9933, Ops = 59598, Name = DW_TAG_variable
Tag = 33, Count = 190, Ops = 190, Name = DW_TAG_subrange_type
Tag = 59, Count = 1, Ops = 3, Name = DW_TAG_unspecified_type
Tag = 40, Count = 24731, Ops = 24731, Name = DW_TAG_enumerator
Tag = 21, Count = 354166, Ops = 2833328, Name = DW_TAG_subroutine_type
Tag = 2, Count = 77999, Ops = 623992, Name = DW_TAG_class_type
Tag = 47, Count = 27122, Ops = 108488, Name = DW_TAG_template_type_parameter
Tag = 28, Count = 8491, Ops = 33964, Name = DW_TAG_inheritance
Tag = 66, Count = 10930, Ops = 43720, Name = DW_TAG_rvalue_reference_type
Tag = 16, Count = 54680, Ops = 218720, Name = DW_TAG_reference_type
Tag = 23, Count = 624, Ops = 4992, Name = DW_TAG_union_type
Tag = 4, Count = 5344, Ops = 42752, Name = DW_TAG_enumeration_type
Tag = 11, Count = 360390, Ops = 1081170, Name = DW_TAG_lexical_block
Tag = 258, Count = 1, Ops = 1, Name = DW_TAG_expression
Tag = 13, Count = 73880, Ops = 299110, Name = DW_TAG_member
Tag = 58, Count = 1387, Ops = 4161, Name = DW_TAG_imported_module
Tag = 1, Count = 2747, Ops = 21976, Name = DW_TAG_array_type
Tag = 46, Count = 1341021, Ops = 12069189, Name = DW_TAG_subprogram
Tag = 257, Count = 4373879, Ops = 20785065, Name = DW_TAG_arg_variable
Tag = 8, Count = 2246, Ops = 6738, Name = DW_TAG_imported_declaration
Tag = 53, Count = 57, Ops = 228, Name = DW_TAG_volatile_type
Tag = 15, Count = 55163, Ops = 220652, Name = DW_TAG_pointer_type
Tag = 41, Count = 3382, Ops = 6764, Name = DW_TAG_file_type
Tag = 22, Count = 158479, Ops = 633916, Name = DW_TAG_typedef
Tag = 48, Count = 486, Ops = 2430, Name = DW_TAG_template_value_parameter
Tag = 36, Count = 15, Ops = 45, Name = DW_TAG_base_type
Tag = 17, Count = 1164, Ops = 8148, Name = DW_TAG_compile_unit
Tag = 31, Count = 19, Ops = 95, Name = DW_TAG_ptr_to_member_type
Tag = 57, Count = 2034, Ops = 6102, Name = DW_TAG_namespace
Tag = 38, Count = 32133, Ops = 128532, Name = DW_TAG_const_type
Tag = 19, Count = 72995, Ops = 583960, Name = DW_TAG_structure_type
(Note: the InlinedLineTables stat is included in LineTables stat.)
You can determine the rough memory footprint of each type of node by
multiplying the "Count" by `sizeof(MDNode)` (x86-64: 56B) and the "Ops"
by `sizeof(MDNodeOperand)` (x86-64: 32B).
Overall, there are 7.5M linetables with 30M operands, so by this method
their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M
operands, so their footprint is ~1.7GB.
I dumped another stat periodically to tell me the peak size of the
side-tables for line table entries, which are split into "Scopes" (for
non-inlined) and "Inlined" (these counts are disjoint, unlike the
Scopes = 203166 , Inlined = 3500000 
I assumed that both `DenseMap` and `std::vector` over-allocate by 50%
to estimate the current (and planned) costs for the side-tables.
Another stat I dumped periodically was the breakdown between V(alues),
U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings).
Here's a sample from nearby:
V = 23967800 (40200000 - 16232200)
U = 5850877 ( 7365503 - 1514626)
C = 205491 ( 279134 - 73643)
M = 16837368 (31009291 - 14171923)
S = 693869 ( 693869 - 0)
Lastly, I dumped a breakdown of the types of MDNodeOperands. This is
also a sample from nearby:
MDOps = 77644750 (100%)
Const = 14947077 ( 19%)
Node = 41749475 ( 53%)
Str = 9553581 ( 12%)
Null = 10976693 ( 14%)
Other = 417924 ( 0%)
While I didn't use this breakdown for my memory estimates, it was
interesting nevertheless. Note the following:
- The number of constants is just under 15M. This dump came less than
a second before the dump above, where we have 7.5M line table
entries. Line table entries have 2 operands of `ConstantInt`. This
lines up nicely.
Note: this checked `isa<Constant>(Op) && !isa<GlobalValue>(Op)`.
- There are a lot of null operands. By making subclasses for the
various types of debug info IR, we can probably shed some of these
- There are few "Other" operands. These are likely all `GlobalValue`
references, and are the only operands that need to be referenced
using value handles.
More information about the llvm-dev