[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR

Wed Oct 15 14:34:24 PDT 2014

On Wed, Oct 15, 2014 at 2:32 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>
> On Wed, Oct 15, 2014 at 2:31 PM, Eric Christopher <echristo at gmail.com>
> wrote:
>>
>> On Wed, Oct 15, 2014 at 2:30 PM, Sean Silva <chisophugis at gmail.com> wrote:
>> >
>> >
>> > On Mon, Oct 13, 2014 at 7:01 PM, Eric Christopher <echristo at gmail.com>
>> > wrote:
>> >>
>> >> On Mon, Oct 13, 2014 at 6:59 PM, Sean Silva <chisophugis at gmail.com>
>> >> wrote:
>> >> > For those interested, I've attached some pie charts based on Duncan's
>> >> > data
>> >> > in one of the other posts; successive slides break down the usage
>> >> > increasingly finely. To my understanding, they represent the number
>> >> > of
>> >> > Value's (and subclasses) allocated.
>> >> >
>> >> > On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith
>> >> > <dexonsmith at apple.com> wrote:
>> >> >>
>> >> >> In r219010, I merged integer and string fields into a single header
>> >> >> field.  By reducing the number of metadata operands used in debug
>> >> >> info,
>> >> >> this saved 2.2GB on an `llvm-lto` bootstrap.  I've done some
>> >> >> profiling
>> >> >> of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next,
>> >> >> and
>> >> >> I've concluded that they will be insufficient.
>> >> >>
>> >> >> Instead, I'd like to implement a more aggressive plan, which as a
>> >> >> side-effect cleans up the much "loved" debug info IR assembly
>> >> >> syntax.
>> >> >>
>> >> >> At a high-level, the idea is to create distinct subclasses of
>> >> >> `Value`
>> >> >> for each debug info concept, starting with line table entries and
>> >> >> moving
>> >> >> on to the DIDescriptor hierarchy.  By leveraging the use-list
>> >> >> infrastructure for metadata operands -- i.e., only using value
>> >> >> handles
>> >> >> for non-metadata operands -- we'll improve memory usage and increase
>> >> >> RAUW speed.
>> >> >>
>> >> >> My rough plan follows.  I quote some numbers for memory savings
>> >> >> below
>> >> >> based on an -flto -g bootstrap of `llvm-lto` (i.e., running
>> >> >> `llvm-lto`
>> >> >> on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by
>> >> >> ld64's
>> >> >> -save-temps option) that currently peaks at 15.3GB.
>> >> >
>> >> >
>> >> > Stupid question, but when I was working on LTO last Summer the
>> >> > primary
>> >> > culprit for excessive memory use was due to us not being smart when
>> >> > linking
>> >> > the IR together (Espindola would know more details). Do we still have
>> >> > that
>> >> > problem? For starters, how does the memory usage of just llvm-link
>> >> > compare
>> >> > to the memory usage of the actual LTO run? If the issue I was seeing
>> >> > last
>> >> > Summer is still there, you should see that the invocation of
>> >> > llvm-link
>> >> > is
>> >> > actually the most memory-intensive part of the LTO step, by far.
>> >> >
>> >>
>> >> This is vague. Could you be more specific on where you saw all of the
>> >> memory?
>> >
>> >
>> > Running `llvm-link *.bc` would OOM a machine with 64GB of RAM (with -g;
>> > without -g it completed with much less). The increasing could be easily
>> > watched on the system "process monitor" in real time.
>> >
>>
>> This is likely what we've already discussed and was handled a long
>> while ago now.
>>
>
> I was reading the thread in sequential order (and replying without
> finishing). derp.

No worries, and hey, you might have had something else which we'd
definitely want to hear about :)

Heck, for that matter we know there are other things so numbers are awesome.

-eric

>
> -- Sean Silva
>
>>
>> -eric
>>
>> > -- Sean Silva
>> >
>> >>
>> >>
>> >> -eric
>> >>
>> >> >
>> >> > Also, you seem to really like saying "peak" here. Is there a definite
>> >> > peak?
>> >> > When does it occur?
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >>  1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s
>> >> >>     must all be metadata.  The cost per operand is 1 pointer, vs. 4
>> >> >>     pointers in an `MDNode`.
>> >> >>
>> >> >>  2. Create `MDLineTable` as the first subclass of `MDUser`.  Use
>> >> >> normal
>> >> >>     fields (not `Value`s) for the line and column, and use `Use`
>> >> >>     operands for the metadata operands.
>> >> >>
>> >> >>     On x86-64, this will save 104B / line table entry.  Linking
>> >> >>     `llvm-lto` uses ~7M line-table entries, so this on its own saves
>> >> >>     ~700MB.
>> >> >>
>> >> >>
>> >> >>     Sketch of class definition:
>> >> >>
>> >> >>         class MDLineTable : public MDUser {
>> >> >>           unsigned Line;
>> >> >>           unsigned Column;
>> >> >>         public:
>> >> >>           static MDLineTable *get(unsigned Line, unsigned Column,
>> >> >>                                   MDNode *Scope);
>> >> >>           static MDLineTable *getInlined(MDLineTable *Base, MDNode
>> >> >> *Scope);
>> >> >>           static MDLineTable *getBase(MDLineTable *Inlined);
>> >> >>
>> >> >>           unsigned getLine() const { return Line; }
>> >> >>           unsigned getColumn() const { return Column; }
>> >> >>           bool isInlined() const { return getNumOperands() == 2; }
>> >> >>           MDNode *getScope() const { return getOperand(0); }
>> >> >>           MDNode *getInlinedAt() const { return getOperand(1); }
>> >> >>         };
>> >> >>
>> >> >>     Proposed assembly syntax:
>> >> >>
>> >> >>         ; Not inlined.
>> >> >>         !7 = metadata !MDLineTable(line: 45, column: 7, scope:
>> >> >> metadata
>> >> >> !9)
>> >> >>
>> >> >>         ; Inlined.
>> >> >>         !7 = metadata !MDLineTable(line: 45, column: 7, scope:
>> >> >> metadata
>> >> >> !9,
>> >> >>                                    inlinedAt: metadata !10)
>> >> >>
>> >> >>         ; Column defaulted to 0.
>> >> >>         !7 = metadata !MDLineTable(line: 45, scope: metadata !9)
>> >> >>
>> >> >>     (What colour should that bike shed be?)
>> >> >>
>> >> >>  3. (Optional) Rewrite `DebugLoc` lookup tables.  My profiling shows
>> >> >>     that we have 3.5M entries in the `DebugLoc` side-vectors for 7M
>> >> >> line
>> >> >>     table entries.  The cost of these is ~180B each, for another
>> >> >>     ~600MB.
>> >> >>
>> >> >>     If we integrate a side-table of `MDLineTable`s into its
>> >> >> uniquing,
>> >> >>     the overhead is only ~12B / line table entry, or ~80MB.  This
>> >> >> saves
>> >> >>     520MB.
>> >> >>
>> >> >>     This is somewhat perpendicular to redesigning the metadata
>> >> >> format,
>> >> >>     but IMO it's worth doing as soon as it's possible.
>> >> >>
>> >> >>  4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`
>> >> >>     through an intermediate class `DebugMDNode` with an
>> >> >>     allocation-time-optional `CallbackVH` available for referencing
>> >> >>     non-metadata.  Change `DIDescriptor` to wrap a `DebugMDNode`
>> >> >> instead
>> >> >>     of an `MDNode`.
>> >> >>
>> >> >>     This saves another ~960MB, for a running total of ~2GB.
>> >> >
>> >> >
>> >> > 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we
>> >> > have
>> >> > a
>> >> > single pie slice near 40% of the # of Value's allocated and another
>> >> > at
>> >> > 21%.
>> >> > Especially this being "step 4".
>> >> >
>> >> > As a rough back of the envelope calculation, dividing 15.3GB by ~24
>> >> > million
>> >> > Values gives about 600 bytes per Value. That seems sort of excessive
>> >> > (but is
>> >> > it realistic?). All of the data types that you are proposing to
>> >> > shrink
>> >> > fall
>> >> > far short of this "average size", meaning that if you are trying to
>> >> > reduce
>> >> > memory usage, you might be looking in the wrong place. Something
>> >> > smells
>> >> > fishy. At the very least, this would indicate that the real memory
>> >> > usage
>> >> > is
>> >> > elsewhere.
>> >> >
>> >> > A pie chart breaking down the total memory usage seems essential to
>> >> > have
>> >> > here.
>> >> >
>> >> >>
>> >> >>
>> >> >>     Proposed assembly syntax:
>> >> >>
>> >> >>         !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,
>> >> >>                                           fields: "0\00clang
>> >> >> 3.6\00...",
>> >> >>                                           operands: { metadata !8,
>> >> >> ...
>> >> >> })
>> >> >>
>> >> >>         !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,
>> >> >>                                           fields:
>> >> >> "global_var\00...",
>> >> >>                                           operands: { metadata !8,
>> >> >> ...
>> >> >> },
>> >> >>                                           handle: i32* @global_var)
>> >> >>
>> >> >>     This syntax pulls the tag out of the current header-string,
>> >> >> calls
>> >> >>     the rest of the header "fields", and includes the metadata
>> >> >> operands
>> >> >>     in "operands".
>> >> >>
>> >> >>  5. Incrementally create subclasses of `DebugMDNode`, such as
>> >> >>     `MDCompileUnit` and `MDSubprogram`.  Sub-classed nodes replace
>> >> >> the
>> >> >>     "fields" and "operands" catch-alls with explicit names for each
>> >> >>     operand.
>> >> >>
>> >> >>     Proposed assembly syntax:
>> >> >>
>> >> >>         !7 = metadata !MDSubprogram(line: 45, name: "foo",
>> >> >> displayName:
>> >> >> "foo",
>> >> >>                                     linkageName: "_Z3foov", file:
>> >> >> metadata
>> >> >> !8,
>> >> >>                                     function: i32 (i32)* @foo)
>> >> >>
>> >> >>  6. Remove the dead code for `GenericDebugMDNode`.
>> >> >>
>> >> >>  7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW
>> >> >>     traffic during bitcode serialization.  Now that metadata types
>> >> >> are
>> >> >>     known, we can write debug info out in an order that makes it
>> >> >> cheap
>> >> >>     to read back in.
>> >> >>
>> >> >>     Note that using `MDUser` will make RAUW much cheaper, since
>> >> >> we're
>> >> >>     using the use-list infrastructure for most of them.  If RAUW
>> >> >> isn't
>> >> >>     showing up in a profile, I may skip this.
>> >> >>
>> >> >> Does this direction seem reasonable?  Any major problems I've
>> >> >> missed?
>> >> >
>> >> >
>> >> > You need more data. Right now you have essentially one data point,
>> >> > and
>> >> > it's
>> >> > not even clear what you measured really. If your goal is saving
>> >> > memory,
>> >> > I
>> >> > would expect at least a pie chart that breaks down LLVM's memory
>> >> > usage
>> >> > (not
>> >> > just # of allocations of different sorts; an approximation is fine,
>> >> > as
>> >> > long
>> >> > as you explain how you arrived at it and in what sense it
>> >> > approximates
>> >> > the
>> >> > true number).
>> >> >
>> >> > Do the numbers change significantly for different projects? (e.g.
>> >> > Chromium
>> >> > or Firefox or a kernel or a large app you have handy to compile with
>> >> > LTO?).
>> >> > If you have specific data you want (and a suggestion for how to
>> >> > gather
>> >> > it),
>> >> > I can also get your numbers for one of our internal games as well.
>> >> >
>> >> > Once you have some more data, then as a first step, I would like to
>> >> > see
>> >> > an
>> >> > analysis of how much we can "ideally" expect to gain (back of the
>> >> > envelope
>> >> > calculations == win).
>> >> >
>> >> > -- Sean Silva
>> >> >
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> LLVM Developers mailing list
>> >> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> >> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >> >
>> >> >
>> >
>> >
>
>