[LLVMdev] Proposal: type uniquing of debug info for LTO

Fri Jun 21 10:39:59 PDT 2013

On Jun 21, 2013, at 10:18 AM, David Blaikie wrote:

> On Fri, Jun 21, 2013 at 10:13 AM, Manman Ren <mren at apple.com> wrote:
>> 
>> More details please :]
>> What do you mean by "name the metadata"? Are you referring to the name field
>> of the MDNode?
> 
> Using named metadata rather than unnamed metadata.
> 
> rather than having:
> 
> !llvm.hardref = !{metadata !"foo.h::myClass", !3, metadata
> !"bar.h:myEnum", !4} ...
> !3 = ...;
> !4 = ...;
> 
> we could simply have:
> 
> !llvm.hardref.foo.h.myClass = ...;
> !llvm.hardref.bar.h.myClass = ...;
> 
> or something like that.

From the documentation:
NamedMDNode - a tuple of MDNodes. Despite its name, a NamedMDNode isn't itself an MDNode. NamedMDNodes belong to modules, have names, and contain lists of MDNodes.

Our DI classes all wrap around a MDNode. Also during IR linking, NamedMDNode is handled differently.
Did I miss something?

Thanks,
Manman
> 
> 
> (assuming solution (b), in solution (a) it'd look more like
> "llvm.dbg.type.foo.h.myClass" - but the same idea (then we could skip
> using the retained types list, avoid adding a new 'name' member to
> DITypes or changing the schema of the retained types list to be
> string/mdnode pairs))
> 
>> 
>> Thanks,
>> Manman
>> 
>> On Jun 21, 2013, at 7:19 AM, David Blaikie wrote:
>> 
>> 
>> On Jun 21, 2013 1:19 AM, "Jeremy Lakeman" <Jeremy.Lakeman at gmail.com> wrote:
>>> 
>>> Wouldn't it be simpler to name the metadata based on the hash of the
>>> content? Then you could use a normal reference to that metadata without
>>> needing to create a new type or teach the rest of llvm how to use it...
>> 
>> Sounds reasonable (probably with some nice prefix to get these things a
>> namespace). Though this applies to both (a) and (b) equally.
>> 
>>> 
>>> 
>>> On Fri, Jun 21, 2013 at 3:22 PM, Manman Ren <mren at apple.com> wrote:
>>>> 
>>>> 
>>>> A summary of options for issue #3:
>>>> 3> To actually access the MDNode referenced via the hash value, we need
>>>> to perform a lookup from the hash value to find the corresponding MDNode.
>>>> The questions are where to store this map and how to keep it up-to-date
>>>> when a MDNode is replaced.
>>>> ---------------------
>>>> Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
>>>> functions in DI classes to pass in the map.
>>>> May need to modify printing functions in AsmWriter.
>>>> ---------------------
>>>> Option b) I am going to expand David's description with more details,
>>>> correct me if I am wrong.
>>>>   A special Value called MDHash in parallel with MDString, MDHash needs
>>>> to specify whether it is a use of the hash value, or a definition of the
>>>> hash value
>>>>   The updated class hierarchy:
>>>>   Value -- MDNode
>>>>               -- MDHash
>>>>               -- MDString
>>>>               -- User -- Instruction -- BinaryOperator
>>>> 
>>>>   In the following example:
>>>> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1,
>>>> i64 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>>>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
>>>> offset 0] [from ]
>>>> !6 = metadata !{metadata !7, metadata !9}
>>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a",
>>>> i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>>>> [line 2, size 32, align 32, offset 0] [from int]
>>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base",
>>>> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32
>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ;
>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>>   915398439 in !4 is the definition of a hash value, 915398439 in !7 or
>>>> !9 is the use of a hash value.
>>>> 
>>>> A separate map "HashToNodeMap" is added to LLVMContext.
>>>> MDNode will handle MDHash correctly in the sense that it will update
>>>> HashToNodeMap whenever it creates a MDNode with a MDHash operand.
>>>> MDNode::getOperand will perform a lookup if the actual operand is MDHash
>>>> and return the corresponding MDNode.
>>>> 
>>>> I am not sure whether a named metadata to describe the map is necessary,
>>>> assuming we already have the retained type list for each CU.
>>>> Also at IR linking, the named metadata from all CUs will be combined,
>>>> leading to a long list.
>>>> 
>>>> ---------------------
>>>> David's description is pasted here for convenience:
>>>> a blessed named MD node root that would contain string+node pairs
>>>> (alternating elements probably, for space conservation)
>>>> 
>>>> a means to add an operand to an MD node with an "indirect"
>>>> reference - this would cause the string to be used as the value (with
>>>> some flag that it's a 'special' string) but any access of the node
>>>> through the operand list would return the underlying MDNode from the
>>>> list (lookup via a map in LLVMContext, essentially).
>>>> 
>>>> ---------------------
>>>> Options that are vetoed:
>>>> 1> Specialize Profile function for DI nodes to use MDNodeSet to support
>>>> lookup. No need for a separate map.
>>>> 2> Extend the class hierarchy to have a class for DI nodes:
>>>> Value -- MDNode -- DINode
>>>>             -- MDString
>>>>             -- User -- Instruction -- BinaryOperator
>>>> 
>>>> I personally prefer option b over a since it is much cleaner.
>>>> 
>>>> Thanks,
>>>> Manman
>>>> 
>>>> On Jun 20, 2013, at 5:39 PM, David Blaikie wrote:
>>>> 
>>>>> On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> 
>>>>>> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>>>>>> 
>>>>>> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>>>>>> 
>>>>>> Hi Manman,
>>>>>> 
>>>>>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> The intent of this proposal is to speedup compilation of "-flto -g"
>>>>>> for c++
>>>>>> programs.
>>>>>> This is based on discussions with Adrian, David and Eric.
>>>>>> 
>>>>>> 
>>>>>> Thanks for bringing this back to the list. The original thread was
>>>>>> getting quite long.
>>>>>> 
>>>>>> ---------------------------
>>>>>> Problem:
>>>>>> A single class can be used in multiple source files and the DI (Debug
>>>>>> Info)
>>>>>> class is included in multiple bc files. The duplication of
>>>>>> class definitions causes blow-up in # of MDNodes, # of DIEs, leading
>>>>>> to
>>>>>> large memory requirement.
>>>>>> 
>>>>>> As an example, SPEC xalancbmk requires 7GB of memory when compiled
>>>>>> with
>>>>>> -flto -g.
>>>>>> With a preliminary implementation of type uniquing, the memory usage
>>>>>> will be
>>>>>> down to 2GB.
>>>>>> 
>>>>>> In order to unique types, we have to break cycles in the MDNodes.
>>>>>> 
>>>>>> A simple struct definition
>>>>>> struct Base {
>>>>>> int a;
>>>>>> };
>>>>>> can cause cycles in MDNodes:
>>>>>> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32
>>>>>> 1,
>>>>>> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ;
>>>>>> [
>>>>>> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0]
>>>>>> [from ]
>>>>>> !14 = metadata !{metadata !15, metadata !16}
>>>>>> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata
>>>>>> !"a", i32
>>>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>>>>>> [line
>>>>>> 2, size 32, align 32, offset 0] [from int]
>>>>>> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata
>>>>>> !"Base",
>>>>>> metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1
>>>>>> false, i32
>>>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20,
>>>>>> i32 1} ;
>>>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>>>> 
>>>>>> Cycles: !12 -- !14 -- !15 -- !12
>>>>>>       !12 -- !14 -- !16 -- !12
>>>>>> 
>>>>>> These cycles make it hard to unique the same struct used in two bc
>>>>>> files.
>>>>>> 
>>>>>> ---------------------------
>>>>>> How to fix:
>>>>>> 
>>>>>> We attach a hash value to types to help type uniquing and we also
>>>>>> replace
>>>>>> references to types with their hash values.
>>>>>> For the above struct "Base", we now have the following MDNodes:
>>>>>> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32
>>>>>> 1, i64
>>>>>> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>>>>>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align
>>>>>> 32,
>>>>>> offset 0] [from ]
>>>>>> !6 = metadata !{metadata !7, metadata !9}
>>>>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a",
>>>>>> i32
>>>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>>>>>> [line
>>>>>> 2, size 32, align 32, offset 0] [from int]
>>>>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata
>>>>>> !"Base",
>>>>>> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1
>>>>>> false, i32
>>>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13,
>>>>>> i32 1} ;
>>>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>>>> 
>>>>>> Note that the cycles are gone and !4 has a hash value of 915398439,
>>>>>> and the
>>>>>> references to !4 are replaced with 915398439.
>>>>>> Thanks Eric for suggesting replacing MD reference with a hash value.
>>>>>> 
>>>>>> 
>>>>>> In particular I recommended this:
>>>>>> 
>>>>>> a) For C++ odr replace it with the "hash" that's just a string
>>>>>> representing the type name.
>>>>>> b) For Internal C++ types and all C types replace it with a string
>>>>>> that's a concatenation of the type name and the name of the compile
>>>>>> unit.
>>>>>> 
>>>>>> Yes, that is what we agreed on over email.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> There are a few issues:
>>>>>> 1> How to generate the hash for a given type?
>>>>>> With C++'s ODR, it should be enough by using the context and the name
>>>>>> for
>>>>>> non-internal c++ types.
>>>>>> For internal c++ types and types of other languages, hash will not be
>>>>>> used.
>>>>>> 
>>>>>> 
>>>>>> Explain this?
>>>>>> 
>>>>>> 
>>>>>> For a while, I am going to support both hash and MD reference, once
>>>>>> everything is working with hash,
>>>>>> I will update all debug info testing cases, turn -gtype-uniquing on,
>>>>>> and
>>>>>> remove the other path.
>>>>>> 
>>>>>> For internal c++ types, initially they will follow the path of using
>>>>>> MD
>>>>>> references without a hash.
>>>>>> 
>>>>>> 
>>>>>> My current implementation is to add a few static member functions in
>>>>>> MDNode
>>>>>> to profile DI nodes differently.
>>>>>> + /// If the array of Vals is for debug info, profile it specially and
>>>>>> return true.
>>>>>> + /// If the DI node has a hash value, generate the profile using only
>>>>>> the
>>>>>> hash value and the declaration flag.
>>>>>> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals,
>>>>>> FoldingSetNodeID
>>>>>> &ID);
>>>>>> 
>>>>>> + /// If the MDNode is for debug info, profile it specially and return
>>>>>> true.
>>>>>> + /// If the DI node has a hash value, generate the profile using only
>>>>>> the
>>>>>> hash value and the declaration flag.
>>>>>> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID
>>>>>> &ID);
>>>>>> 
>>>>>> + /// Given a hash value and a flag, generate the profile for later
>>>>>> lookup.
>>>>>> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
>>>>>> FoldingSetNodeID &ID);
>>>>>> 
>>>>>> These static functions are called in Metadata.cpp:
>>>>>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>>>>>> +  if (profileDebugInfoNode(this, ID))
>>>>>> +    return;
>>>>>> +
>>>>>> 
>>>>>> There are other examples of these in MDNode for handling of specific
>>>>>> metadata.
>>>>>> /// Methods for metadata merging.
>>>>>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>>>>>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>>>>>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>>>>>> 
>>>>>> Comments are welcome on whether this violates any layering rule.
>>>>>> 
>>>>>> 
>>>>>> As I've said many times in email, I don't think this is a good idea
>>>>>> and would prefer either a or b below. a) is a much simpler solution.
>>>>>> 
>>>>>> Any reason that why it is not a good idea?
>>>>>> 
>>>>>> 
>>>>>> Other choices are:
>>>>>> a> Keep a map in DwarfDebug
>>>>>> Keep in mind that the map is used at many stages, and it has to be in
>>>>>> sync
>>>>>> with MDNodeSet.
>>>>>> b> Generalize MDNode to be aware of hash (David can provide more
>>>>>> details)
>>>>>> c> Extend MDNode to DINode and modify streamers (bitcode
>>>>>> reader|writer, ll
>>>>>> reader|writer) to be aware of DINode
>>>>>> We can provide DINode::get(…) to create a DINode. DINode can have its
>>>>>> own
>>>>>> Profile function.
>>>>>> Other suggestions are welcome.
>>>>>> 
>>>>>> 
>>>>>> a or b please.
>>>>>> 
>>>>>> Option a> will require a DwarfDebug pointer in every stage of the
>>>>>> compiler,
>>>>>> and passing the map to the DI classes.
>>>>>> A rough estimation is around 100 places.
>>>>>> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm
>>>>>> linker?
>>>>>> Also the map needs to be in sync with MDNodeSet, maybe using
>>>>>> ValueHandle can
>>>>>> solve the problem.
>>>>>> 
>>>>>> 
>>>>>> What about putting the map in LLVMContextImpl?
>>>>>> It already has a few things specifically for debug info:
>>>>>> std::vector<DebugRecVH> ScopeRecords;
>>>>>> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
>>>>>> …
>>>>>> 
>>>>>> I remember David mentioned it once and I forgot about the conclusion.
>>>>>> 
>>>>>> 
>>>>>> I mentioned it only as speculation as to how you were implementing it
>>>>>> already (but you were doing the profile-changing stuff).
>>>>>> 
>>>>>> I don't think it should be necessary to have the map (in option (a))
>>>>>> in such a central location as LLVMContext. It should be usable just
>>>>>> from DwarfDebug for generation, and DIBuilder can have its own,
>>>>>> separate map to do similar things during DI building.
>>>>>> 
>>>>>> 
>>>>>> We also need the map during llvm linking since linking will create new
>>>>>> MDNodes.
>>>>>> 
>>>>>> 
>>>>>> I don't understand what you mean by this. IR linking shouldn't need to
>>>>>> do anything debug-info-specific, it should just be the normal IR
>>>>>> approach.
>>>>>> 
>>>>>> The declaration-v-definition resolution can be done at codegen-time.
>>>>>> We'll have to walk all the lists of retained types anyway to build the
>>>>>> map, so we can do declaration-v-definition (keep definitions over
>>>>>> declarations when we see both) at that point.
>>>>>> 
>>>>>> 
>>>>>> Yes, you are right that at codegen-time, we can generate the map from
>>>>>> the
>>>>>> lists
>>>>>> of retained types.
>>>>>> 
>>>>>> But dumping the linked ll file requires the map when outputting
>>>>>> comments of
>>>>>> the MDNode :]
>>>>> 
>>>>> Depending on which things we print out, but yes, in some cases
>>>>> (derived types) we do print out the type referenced. I assume the
>>>>> AsmPrinter can build such a map, then. (in fact, with a few clients
>>>>> like this, it might be nice to build a bit of an abstraction around it
>>>>> rather than just using a raw map - something that has a ctor (or I
>>>>> suppose it could be a factory function) that reads in the right
>>>>> metadata, walks the compile units, etc, and builds the mapping)
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Manman
>>>>>> 
>>>>>> 
>>>>>> So any other opinion on putting it in LLVMContext other than it being
>>>>>> central?
>>>>>> 
>>>>>> Thanks,
>>>>>> Manman
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Manman
>>>>>> 
>>>>>> 
>>>>>> More details for option b from David
>>>>>> 
>>>>>> < The alternative I have in mind is a more complete version of what
>>>>>> < you're proposing - a full MD feature, not an MD feature that's just
>>>>>> < barely enough to support the needs of debug info. What we could do
>>>>>> is
>>>>>> < allow the insertion of these MDHash things you spoke about but take
>>>>>> it
>>>>>> < a step further and have MDNode::getOperand walk through the hash &
>>>>>> < give the value (in this way, DebugInfo wouldn't have to change at
>>>>>> all
>>>>>> < to handle hashes - if the Metadata APIs are going to be aware of the
>>>>>> < hashes anyway, they might as well provide this convenience
>>>>>> < functionality) the metadata feature would also have to have some
>>>>>> < blessed top-level named metadata that would have a list of
>>>>>> hash+MDNode
>>>>>> < to keep those MDNodes alive (so you wouldn't have to stuff all the
>>>>>> < types in the retained types list - metadata would provide the full
>>>>>> < support, not just half of it).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Transition from current DI Metadata:
>>>>>> To have a smooth transition, I will add a flag "-gtype-hashing" for
>>>>>> the type
>>>>>> uniquing work and turn it on by default when we are ready.
>>>>>> 
>>>>>> 
>>>>>> I'd prefer just make the change to have the front end emit the "hash"
>>>>>> (it's not really a hash, it's just a string).
>>>>>> 
>>>>>> Are you saying no transition period? A single patch to have correct
>>>>>> handling
>>>>>> of "hash" and to update all existing testing cases?
>>>>>> 
>>>>>> 
>>>>>> -----------------------------
>>>>>> Patches:
>>>>>> Expect the following patches:
>>>>>> 1> add flag -gtype-hashing
>>>>>> 2> add hash field to DI types
>>>>>> 3> modify DIBuilder to use hash instead of MD reference
>>>>>> 4> related to issue 3
>>>>>> 
>>>>>> 
>>>>>> These can all be a single patch since it shouldn't be very large if we
>>>>>> go with a) above. If we go with b) then the MDNode work should be done
>>>>>> in isolation first and then the debug info on top of it.
>>>>>> 
>>>>>> What is wrong with smaller patches?
>>>>>> My estimation for all the above with a) is about 30K + testing cases.
>>>>>> 
>>>>>> 
>>>>>> 5> backend change (in DwarfDebug|CompileUnit) to support types shared
>>>>>> among
>>>>>> compile units
>>>>>> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to
>>>>>> ref_addr
>>>>>> 
>>>>>> 
>>>>>> #5 can and should be done before the rest of them.
>>>>>> 
>>>>>> I prefer to submit patches according to the flow of the compiler,
>>>>>> starting
>>>>>> from the frontend, then IR, then backend.
>>>>>> The testing cases will be added for front end, llvm-link and backend.
>>>>>> Any reason why #5 should be done first?
>>>>>> 
>>>>>> 
>>>>>> All changes should be local to debug info classes except patch #4.
>>>>>> 
>>>>>> 
>>>>>> What's patch #4?
>>>>>> 
>>>>>> Patch #4 above: related to issue 3 (changes corresponding to how to
>>>>>> solve
>>>>>> issue #3)
>>>>>> 
>>>>>> -Manman
>>>>>> 
>>>>>> 
>>>>>> -eric
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> 
>>> 
>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130621/0eecc845/attachment.html>