[LLVMdev] Proposal: type uniquing of debug info for LTO

David Blaikie dblaikie at gmail.com
Thu Jun 20 16:10:20 PDT 2013


On Thu, Jun 20, 2013 at 4:06 PM, Manman Ren <mren at apple.com> wrote:
>
> On Jun 20, 2013, at 3:27 PM, David Blaikie wrote:
>
>> +1 to roughly everything Eric said.
>>
>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>>>
>>> The intent of this proposal is to speedup compilation of "-flto -g" for c++ programs.
>>> This is based on discussions with Adrian, David and Eric.
>>>
>>> ---------------------------
>>> Problem:
>>> A single class can be used in multiple source files and the DI (Debug Info) class is included in multiple bc files. The duplication of
>>> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to large memory requirement.
>>>
>>> As an example, SPEC xalancbmk requires 7GB of memory when compiled with -flto -g.
>>> With a preliminary implementation of type uniquing, the memory usage will be down to 2GB.
>>>
>>> In order to unique types, we have to break cycles in the MDNodes.
>>>
>>> A simple struct definition
>>> struct Base {
>>> int a;
>>> };
>>> can cause cycles in MDNodes:
>>> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32 1, i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
>>> !14 = metadata !{metadata !15, metadata !16}
>>> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata !"a", i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line 2, size 32, align 32, offset 0] [from int]
>>> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata !"Base", metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1 false, i32 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32 1} ; [ DW_TAG_subprogram ] [line 1] [Base]
>>>
>>> Cycles: !12 -- !14 -- !15 -- !12
>>>            !12 -- !14 -- !16 -- !12
>>>
>>> These cycles make it hard to unique the same struct used in two bc files.
>>>
>>> ---------------------------
>>> How to fix:
>>>
>>> We attach a hash value to types to help type uniquing and we also replace references to types with their hash values.
>>> For the above struct "Base", we now have the following MDNodes:
>>> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1, i64 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
>>> !6 = metadata !{metadata !7, metadata !9}
>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a", i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line 2, size 32, align 32, offset 0] [from int]
>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base", metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ; [ DW_TAG_subprogram ] [line 1] [Base]
>>>
>>> Note that the cycles are gone and !4 has a hash value of 915398439, and the references to !4 are replaced with 915398439.
>>> Thanks Eric for suggesting replacing MD reference with a hash value.
>>>
>>> There are a few issues:
>>> 1> How to generate the hash for a given type?
>>> With C++'s ODR, it should be enough by using the context and the name for non-internal c++ types.
>>> For internal c++ types and types of other languages, hash will not be used.
>>>
>>> 2> Make sure the streamers (bit code or ll writer) will output all required MDNodes.
>>> For MDNodes that are only used through hash values, we have to make sure they are dumped.
>>
>> should read "to make sure they are not dumped"
> I meant to say make sure the MDNodes are written out to bc or ll file.
>
>
>> - just to clarify for
>> other readers: unnamed metadata is kept alive by being reference
>> (directly or indirectly) by named metadata. If we break the reference
>> by using a string identifier pair, the type metadata wouldn't exist as
>> it would be unreferenced in the IR sense.
>
> Thanks for the clarification.
>>
>>> The solution currently is to add these MDNodes to retained types of the compile unit.
>>
>> So we can just keep real metadata (rather than string/hash/thingy)
>> references to these nodes in a flat list to ensure they don't get
>> dropped.
>>
>>> Thanks David for the suggestion.
>>>
>>> 3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
>>> The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
>>> Each hash value should have at most one corresponding definition and at most one declaration.
>>>
>>> We need the map when building DI for verification purposes, when writing the ll file to dump the comments, at llvm linking time to update the map,
>>> at backend when generating the actual dwarf.
>>>
>>> My current implementation is to add a few static member functions in MDNode to profile DI nodes differently.
>>> + /// If the array of Vals is for debug info, profile it specially and return true.
>>> + /// If the DI node has a hash value, generate the profile using only the hash value and the declaration flag.
>>> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals, FoldingSetNodeID &ID);
>>>
>>> + /// If the MDNode is for debug info, profile it specially and return true.
>>> + /// If the DI node has a hash value, generate the profile using only the hash value and the declaration flag.
>>> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID &ID);
>>>
>>> + /// Given a hash value and a flag, generate the profile for later lookup.
>>> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration, FoldingSetNodeID &ID);
>>>
>>> These static functions are called in Metadata.cpp:
>>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>>> +  if (profileDebugInfoNode(this, ID))
>>> +    return;
>>> +
>>>
>>> There are other examples of these in MDNode for handling of specific metadata.
>>> /// Methods for metadata merging.
>>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>>>
>>> Comments are welcome on whether this violates any layering rule.
>>
>> As we've already discussed off list, both Eric & I believe this isn't
>> an appropriate layering & that LLVM's core IR metadata APIs cannot
>> reasonably have special hooks for this debug info feature.
>
> Understand, that is why it is an option and comments from other people are welcome.
>
>>
>>> Other choices are:
>>> a> Keep a map in DwarfDebug
>>>    Keep in mind that the map is used at many stages, and it has to be in sync with MDNodeSet.
>>
>> I'm not sure what you mean by "in sync with the MDNodeSet".
> When a node is created or RAUW, the map has to be updated.
> MDNodeSet keeps all live MDNodes and the map keeps part of MDNodeSet.

Presumably the map would work with ValueHandles? I don't know IR well
enough & just speculating, really, but I don't see any reason that the
map should have any relation to the MDNodeSet specifically.

>
>>
>>> b> Generalize MDNode to be aware of hash (David can provide more details)
>>
>> My suggestion here, which I think is the only reason to really discuss
>> this broadly (because it would be an LLVM metadata API feature) is
>> roughly as follows:
>>
>> An LLVM IR feature, completely independent from debug info that would
>> work something like:
>> a) a blessed named MD node root that would contain string+node pairs
>> (alternating elements probably, for space conservation)
>> b) a means to add an operand to an MD node with an "indirect"
>> reference - this would cause the string to be used as the value (with
>> some flag that it's a 'special' string) but any access of the node
>> through the operand list would return the underlying MDNode from the
>> list (lookup via a map in LLVMContext, essentially).
>>
>> I'm not at all deeply familiar with the LLVM IR metadata APIs - so
>> this is a strawman at best but an idea about how we could provide the
>> functionality that debug info could benefit from in an isolated, not
>> debug-specific, feature. If we're going to touch metadata APIs I think
>> something like this should be how we do it.
>>
>> If anyone more familiar with these APIs & the direction we should and
>> shouldn't take them could clarify if this is
>> possible/reasonable/acceptable/the right tradeoff compared to (a) that
>> would be much appreciated. That's what I'd like to get out of this
>> thread.
>>
>> (for some more context: (a) will be isolated to debug info code, but
>> will require (in my naive estimation) a fair amount of mechanical work
>> to perform manual lookups on types whenever we access them - at some
>> point we may have a patch that demonstrates just how much work that
>> is, but until then we can only make vague statements, and by the time
>> we do this by any substantial amount it will've been a fair bit of
>> work only to throw it away in favor of (b) if it looks too unweildy.)
>>
>>> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer, ll reader|writer) to be aware of DINode
>>>    We can provide DINode::get(…) to create a DINode. DINode can have its own Profile function.
>>
>> As discussed offline, Eric & I really don't think this is acceptable.
>> Again, those who own/maintain/have opinions on LLVM metadata APIs can
>> weigh in if you think we're off-base in that assessment.
>
> It is an option to go in the other direction, either generalize MDNode or have a special class for DINode.

The concern is that having a special case such as what you've
described for DINode (the naming isn't important, the concept is) is
that it's not a generally usable/complete feature to add to the
metadata APIs. If we can generalize a useful standalone feature to add
to metadata support (I believe what I've described in (b) is such a
thing) then that's one thing - but to have half a feature in there is
a bit uncomfortable.

- David

>
> - Manman
>>
>>> Other suggestions are welcome.
>>>
>>> ---------------------------
>>> Preliminary Results:
>>> SPEC xalancbmk: ld time down from 20+ mins to 4+ mins on my Mac Air
>>>
>>> -----------------------------
>>> Transition from current DI Metadata:
>>> To have a smooth transition, I will add a flag "-gtype-hashing" for the type uniquing work and turn it on by default when we are ready.
>>>
>>> -----------------------------
>>> Patches:
>>> Expect the following patches:
>>> 1> add flag -gtype-hashing
>>> 2> add hash field to DI types
>>
>> If we can avoid changing the schema of DI types it's really best that
>> way - especially adding (or removing) fields, moreso while that's
>> under a flag. It'll be a real pain to update test cases & diverge the
>> schema making it harder to maintain good test coverage.
>>
>> Essentially what we want to do is a single pass (as Eric said) - where
>> before there were MDNodes, there would now be strings. That means
>> updating DIBuilder and DwarfDebug, etc is one go, basically.
>>
>>> 3> modify DIBuilder to use hash instead of MD reference
>>> 4> related to issue 3
>>> 5> backend change (in DwarfDebug|CompileUnit) to support types shared among compile units
>>>    requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to ref_addr
>>> All changes should be local to debug info classes except patch #4.
>




More information about the llvm-dev mailing list