<div dir="ltr">For those interested, I've attached some pie charts based on Duncan's data in one of the other posts; successive slides break down the usage increasingly finely. To my understanding, they represent the number of Value's (and subclasses) allocated.<br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 13, 2014 at 3:02 PM, Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">In r219010, I merged integer and string fields into a single header<br>

field.  By reducing the number of metadata operands used in debug info,<br>

this saved 2.2GB on an `llvm-lto` bootstrap.  I've done some profiling<br>

of DW_TAGs to see what parts of PR17891 and PR17892 to tackle next, and<br>

I've concluded that they will be insufficient.<br>

<br>

Instead, I'd like to implement a more aggressive plan, which as a<br>

side-effect cleans up the much "loved" debug info IR assembly syntax.<br>

<br>

At a high-level, the idea is to create distinct subclasses of `Value`<br>

for each debug info concept, starting with line table entries and moving<br>

on to the DIDescriptor hierarchy.  By leveraging the use-list<br>

infrastructure for metadata operands -- i.e., only using value handles<br>

for non-metadata operands -- we'll improve memory usage and increase<br>

RAUW speed.<br>

<br>

My rough plan follows.  I quote some numbers for memory savings below<br>

based on an -flto -g bootstrap of `llvm-lto` (i.e., running `llvm-lto`<br>

on `llvm-lto.lto.bc`, an already-linked bitcode file dumped by ld64's<br>

-save-temps option) that currently peaks at 15.3GB.<br></blockquote><div><br></div><div>Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far.</div><div><br></div><div><br></div><div><div>Also, you seem to really like saying "peak" here. Is there a definite peak? When does it occur?</div><div> </div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

 1. Introduce `MDUser`, which inherits from `User`, and whose `Use`s<br>

    must all be metadata.  The cost per operand is 1 pointer, vs. 4<br>

    pointers in an `MDNode`.<br>

<br>

 2. Create `MDLineTable` as the first subclass of `MDUser`.  Use normal<br>

    fields (not `Value`s) for the line and column, and use `Use`<br>

    operands for the metadata operands.<br>

<br>

    On x86-64, this will save 104B / line table entry.  Linking<br>

    `llvm-lto` uses ~7M line-table entries, so this on its own saves<br>

    ~700MB.</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

    Sketch of class definition:<br>

<br>

        class MDLineTable : public MDUser {<br>

          unsigned Line;<br>

          unsigned Column;<br>

        public:<br>

          static MDLineTable *get(unsigned Line, unsigned Column,<br>

                                  MDNode *Scope);<br>

          static MDLineTable *getInlined(MDLineTable *Base, MDNode *Scope);<br>

          static MDLineTable *getBase(MDLineTable *Inlined);<br>

<br>

          unsigned getLine() const { return Line; }<br>

          unsigned getColumn() const { return Column; }<br>

          bool isInlined() const { return getNumOperands() == 2; }<br>

          MDNode *getScope() const { return getOperand(0); }<br>

          MDNode *getInlinedAt() const { return getOperand(1); }<br>

        };<br>

<br>

    Proposed assembly syntax:<br>

<br>

        ; Not inlined.<br>

        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9)<br>

<br>

        ; Inlined.<br>

        !7 = metadata !MDLineTable(line: 45, column: 7, scope: metadata !9,<br>

                                   inlinedAt: metadata !10)<br>

<br>

        ; Column defaulted to 0.<br>

        !7 = metadata !MDLineTable(line: 45, scope: metadata !9)<br>

<br>

    (What colour should that bike shed be?)<br>

<br>

 3. (Optional) Rewrite `DebugLoc` lookup tables.  My profiling shows<br>

    that we have 3.5M entries in the `DebugLoc` side-vectors for 7M line<br>

    table entries.  The cost of these is ~180B each, for another<br>

    ~600MB.<br>

<br>

    If we integrate a side-table of `MDLineTable`s into its uniquing,<br>

    the overhead is only ~12B / line table entry, or ~80MB.  This saves<br>

    520MB.<br>

<br>

    This is somewhat perpendicular to redesigning the metadata format,<br>

    but IMO it's worth doing as soon as it's possible.<br>

<br>

 4. Create `GenericDebugMDNode`, a transitional subclass of `MDUser`<br>

    through an intermediate class `DebugMDNode` with an<br>

    allocation-time-optional `CallbackVH` available for referencing<br>

    non-metadata.  Change `DIDescriptor` to wrap a `DebugMDNode` instead<br>

    of an `MDNode`.<br>

<br>

    This saves another ~960MB, for a running total of ~2GB.<br></blockquote><div><br></div><div>2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4".</div><div><br></div><div>As a rough back of the envelope calculation, dividing 15.3GB by ~24 million Values gives about 600 bytes per Value. That seems sort of excessive (but is it realistic?). All of the data types that you are proposing to shrink fall far short of this "average size", meaning that if you are trying to reduce memory usage, you might be looking in the wrong place. Something smells fishy. At the very least, this would indicate that the real memory usage is elsewhere.</div><div><br></div><div>A pie chart breaking down the total memory usage seems essential to have here.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

    Proposed assembly syntax:<br>

<br>

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_compile_unit,<br>

                                          fields: "0\00clang 3.6\00...",<br>

                                          operands: { metadata !8, ... })<br>

<br>

        !7 = metadata !GenericDebugMDNode(tag: DW_TAG_variable,<br>

                                          fields: "global_var\00...",<br>

                                          operands: { metadata !8, ... },<br>

                                          handle: i32* @global_var)<br>

<br>

    This syntax pulls the tag out of the current header-string, calls<br>

    the rest of the header "fields", and includes the metadata operands<br>

    in "operands".<br>

<br>

 5. Incrementally create subclasses of `DebugMDNode`, such as<br>

    `MDCompileUnit` and `MDSubprogram`.  Sub-classed nodes replace the<br>

    "fields" and "operands" catch-alls with explicit names for each<br>

    operand.<br>

<br>

    Proposed assembly syntax:<br>

<br>

        !7 = metadata !MDSubprogram(line: 45, name: "foo", displayName: "foo",<br>

                                    linkageName: "_Z3foov", file: metadata !8,<br>

                                    function: i32 (i32)* @foo)<br>

<br>

 6. Remove the dead code for `GenericDebugMDNode`.<br>

<br>

 7. (Optional) Refactor `DebugMDNode` sub-classes to minimize RAUW<br>

    traffic during bitcode serialization.  Now that metadata types are<br>

    known, we can write debug info out in an order that makes it cheap<br>

    to read back in.<br>

<br>

    Note that using `MDUser` will make RAUW much cheaper, since we're<br>

    using the use-list infrastructure for most of them.  If RAUW isn't<br>

    showing up in a profile, I may skip this.<br>

<br>

Does this direction seem reasonable?  Any major problems I've missed?<br></blockquote><div><br></div><div>You need more data. Right now you have essentially one data point, and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number).</div><div><br></div><div>Do the numbers change significantly for different projects? (e.g. Chromium or Firefox or a kernel or a large app you have handy to compile with LTO?). If you have specific data you want (and a suggestion for how to gather it), I can also get your numbers for one of our internal games as well.</div><div><br></div><div>Once you have some more data, then as a first step, I would like to see an analysis of how much we can "ideally" expect to gain (back of the envelope calculations == win).</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</blockquote></div><br></div></div>