[LLVMdev] Proposal: type uniquing of debug info for LTO

Fri Jun 21 15:16:22 PDT 2013

On Jun 21, 2013, at 1:12 PM, David Blaikie <dblaikie at gmail.com> wrote:

> On Fri, Jun 21, 2013 at 12:33 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> On Jun 21, 2013, at 11:56 AM, Eric Christopher wrote:
>> 
>>> On Fri, Jun 21, 2013 at 11:50 AM, Manman Ren <mren at apple.com> wrote:
>>>> 
>>>> On Jun 21, 2013, at 11:35 AM, Eric Christopher wrote:
>>>> 
>>>>> On Thu, Jun 20, 2013 at 10:52 PM, Manman Ren <mren at apple.com> wrote:
>>>>>> 
>>>>>> A summary of options for issue #3:
>>>>>> 3> To actually access the MDNode referenced via the hash value, we need to perform a lookup from the hash value to find the corresponding MDNode.
>>>>>> The questions are where to store this map and how to keep it up-to-date when a MDNode is replaced.
>>>>>> ---------------------
>>>>>> Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access functions in DI classes to pass in the map.
>>>>>> May need to modify printing functions in AsmWriter.
>>>>> 
>>>>> You may not need the map in DIBuilder if you piggy back on the
>>>>> existing map in CGDebugInfo.cpp. The AsmWriter stuff is unfortunate
>>>>> for just debug dumps of the IR. A way around this might be nice, but
>>>>> I've not looked into it.
>> 
>> Calling dump() on a MDNode inside gdb will not give us the derived type information, since we don't have a handle to the map.
> 
> The only case where dump currently navigates a type link that I know
> is printing derived types, where we print "[from foo]". By just
> printing the string identifier for the type (without having to
> navigate the link/use the map) should be about as useful here.
> 
>> 
>>>>> 
>>>>>> ---------------------
>>>>>> Option b) I am going to expand David's description with more details, correct me if I am wrong.
>>>>> 
>>>>> I'll let David continue this part of the thread with you.
>>>>> 
>>>>>> 
>>>>>> I personally prefer option b over a since it is much cleaner.
>>>>>> 
>>>>> 
>>>>> Option b is more general and should probably work. I suggested a
>>>>> originally because I know it would work and the impact is limited to
>>>>> debug info. Also it doesn't involve touching the generic metadata
>>>>> interface or code and could be done quickly and incrementally. b may
>>>>> also require further changes to the debug info infrastructure to deal
>>>>> with lookup, etc so may end up being as much churn as a. b is more
>>>>> useful if we think other users of metadata may want/need similar
>>>>> functionality in the future.
>>>>> 
>>>>> I'm fine with b as a choice, but it's going to involve more planning
>>>>> and code review and involve people outside of you, me and Dave :)
>>>> 
>>>> Got it. David, are you okay with the details I added about option b?
>>>> Let's start a new thread with title "Proposal: extend Metadata (MDNode) to support MDHash" or something like that.
>>> 
>>> I'd like to hear a rationale for wanting b first. I don't know (and
>>> can think of) any other users of this functionality right now.
>> 
>> I will try my best :]
>> MDNode has a list of operands, one MDNode's full content can be thought of as a graph, when there is a reference to another MDNode, we have
>> an edge. The MDHash can be used to give a unique id (or string) to the full content of the graph.
>> 
>> The hash value applies to any kind of MDNode with complicated MDNode references.
>> 
>> If you think this is not general enough to be in MDNode,
> 
> Eric's concern/question is whether it's a feature that would be useful
> to any other code, I believe. (I'm less concerned about this so long
> as the feature is appropriately general, even if it doesn't have any
> other consumers at the moment - I don't think it'd drastically
> complicate the Metadata handling code (which I think is currently
> fairly simple, but I admit to not knowing it in any great detail))

Yes, I tried to explain why MDHash can help with MDNodes that form a complicated graph.

> 
>> my earlier proposal of extending MDNode to DINode is going the other direction.
> 
> As discussed, this isn't far enough in the other direction. If we're
> going to touch core LLVM Metadata handling code at all, it needs to be
> a complete/cohesive feature, not just tendrils of a debug info
> feature. Hence (a) or (b) not (c).
> 
>> Basically, MDNode will have an opcode like the opcode for Instruction, to support TBAA, DI, and other types.
>> Inside streamers (bc ll reader), we can create the specialized class according to the opcode.
>> That requires a lot of changes too :(
>> 
>>> 
>>>> We can add a weight to MDHash (other names are fine with me), to support multiple MDNodes with the same hash, but different weight.
>>>> 
>>> 
>>> Wait, what? This sounds like a bad idea.
>> Why?
>> The weight is needed to differentiate a forward declaration vs. a definition, they both name the same class, but they are different.
> 
> This would probably tip me over into Eric's perspective that it would
> be too esoteric a feature to add to a generic/core piece of
> infrastructure.

If we allow the map to map one hash value (or string) to multiple MDNodes, without adding the weight, it should still work.
But is that general enough?

Thanks,
Manman

> 
> Yes, I'm not sure we should resolve duplicates in this case, then
> (short of assigning a separate identifier to declarations V
> definitions so they don't collide/get deduped - then doing a pass in
> DwarfDebug/during debug-info-generation time to resolve in favor of
> definitions (hmm, yeah, even that doesn't quite make sense - you'd
> have to be able to visit the things referring to the declarations &
> update them to refer to definitions, and we'd have no real way to
> discover those easily except walking all the debug info... could
> work))
> 
>> 
>> Thanks,
>> Manman
>> 
>>> 
>>> -eric
>>> 
>>>> Thanks,
>>>> Manman
>>>> 
>>>>> 
>>>>> -eric
>>>>> 
>>>>> 
>>>>>> Thanks,
>>>>>> Manman
>>>>>> 
>>>>>> On Jun 20, 2013, at 5:39 PM, David Blaikie wrote:
>>>>>> 
>>>>>>> On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>>>> 
>>>>>>>> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>>>>>>>> 
>>>>>>>> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>>>>>>>> 
>>>>>>>> Hi Manman,
>>>>>>>> 
>>>>>>>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The intent of this proposal is to speedup compilation of "-flto -g" for c++
>>>>>>>> programs.
>>>>>>>> This is based on discussions with Adrian, David and Eric.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for bringing this back to the list. The original thread was
>>>>>>>> getting quite long.
>>>>>>>> 
>>>>>>>> ---------------------------
>>>>>>>> Problem:
>>>>>>>> A single class can be used in multiple source files and the DI (Debug Info)
>>>>>>>> class is included in multiple bc files. The duplication of
>>>>>>>> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to
>>>>>>>> large memory requirement.
>>>>>>>> 
>>>>>>>> As an example, SPEC xalancbmk requires 7GB of memory when compiled with
>>>>>>>> -flto -g.
>>>>>>>> With a preliminary implementation of type uniquing, the memory usage will be
>>>>>>>> down to 2GB.
>>>>>>>> 
>>>>>>>> In order to unique types, we have to break cycles in the MDNodes.
>>>>>>>> 
>>>>>>>> A simple struct definition
>>>>>>>> struct Base {
>>>>>>>> int a;
>>>>>>>> };
>>>>>>>> can cause cycles in MDNodes:
>>>>>>>> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32 1,
>>>>>>>> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [
>>>>>>>> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
>>>>>>>> !14 = metadata !{metadata !15, metadata !16}
>>>>>>>> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata !"a", i32
>>>>>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
>>>>>>>> 2, size 32, align 32, offset 0] [from int]
>>>>>>>> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata !"Base",
>>>>>>>> metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1 false, i32
>>>>>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32 1} ;
>>>>>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>>>>>> 
>>>>>>>> Cycles: !12 -- !14 -- !15 -- !12
>>>>>>>>     !12 -- !14 -- !16 -- !12
>>>>>>>> 
>>>>>>>> These cycles make it hard to unique the same struct used in two bc files.
>>>>>>>> 
>>>>>>>> ---------------------------
>>>>>>>> How to fix:
>>>>>>>> 
>>>>>>>> We attach a hash value to types to help type uniquing and we also replace
>>>>>>>> references to types with their hash values.
>>>>>>>> For the above struct "Base", we now have the following MDNodes:
>>>>>>>> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1, i64
>>>>>>>> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>>>>>>>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
>>>>>>>> offset 0] [from ]
>>>>>>>> !6 = metadata !{metadata !7, metadata !9}
>>>>>>>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a", i32
>>>>>>>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
>>>>>>>> 2, size 32, align 32, offset 0] [from int]
>>>>>>>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base",
>>>>>>>> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32
>>>>>>>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ;
>>>>>>>> [ DW_TAG_subprogram ] [line 1] [Base]
>>>>>>>> 
>>>>>>>> Note that the cycles are gone and !4 has a hash value of 915398439, and the
>>>>>>>> references to !4 are replaced with 915398439.
>>>>>>>> Thanks Eric for suggesting replacing MD reference with a hash value.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In particular I recommended this:
>>>>>>>> 
>>>>>>>> a) For C++ odr replace it with the "hash" that's just a string
>>>>>>>> representing the type name.
>>>>>>>> b) For Internal C++ types and all C types replace it with a string
>>>>>>>> that's a concatenation of the type name and the name of the compile
>>>>>>>> unit.
>>>>>>>> 
>>>>>>>> Yes, that is what we agreed on over email.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> There are a few issues:
>>>>>>>> 1> How to generate the hash for a given type?
>>>>>>>> With C++'s ODR, it should be enough by using the context and the name for
>>>>>>>> non-internal c++ types.
>>>>>>>> For internal c++ types and types of other languages, hash will not be used.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Explain this?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> For a while, I am going to support both hash and MD reference, once
>>>>>>>> everything is working with hash,
>>>>>>>> I will update all debug info testing cases, turn -gtype-uniquing on, and
>>>>>>>> remove the other path.
>>>>>>>> 
>>>>>>>> For internal c++ types, initially they will follow the path of using MD
>>>>>>>> references without a hash.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> My current implementation is to add a few static member functions in MDNode
>>>>>>>> to profile DI nodes differently.
>>>>>>>> + /// If the array of Vals is for debug info, profile it specially and
>>>>>>>> return true.
>>>>>>>> + /// If the DI node has a hash value, generate the profile using only the
>>>>>>>> hash value and the declaration flag.
>>>>>>>> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals, FoldingSetNodeID
>>>>>>>> &ID);
>>>>>>>> 
>>>>>>>> + /// If the MDNode is for debug info, profile it specially and return true.
>>>>>>>> + /// If the DI node has a hash value, generate the profile using only the
>>>>>>>> hash value and the declaration flag.
>>>>>>>> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID &ID);
>>>>>>>> 
>>>>>>>> + /// Given a hash value and a flag, generate the profile for later lookup.
>>>>>>>> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
>>>>>>>> FoldingSetNodeID &ID);
>>>>>>>> 
>>>>>>>> These static functions are called in Metadata.cpp:
>>>>>>>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>>>>>>>> +  if (profileDebugInfoNode(this, ID))
>>>>>>>> +    return;
>>>>>>>> +
>>>>>>>> 
>>>>>>>> There are other examples of these in MDNode for handling of specific
>>>>>>>> metadata.
>>>>>>>> /// Methods for metadata merging.
>>>>>>>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>>>>>>>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>>>>>>>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>>>>>>>> 
>>>>>>>> Comments are welcome on whether this violates any layering rule.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> As I've said many times in email, I don't think this is a good idea
>>>>>>>> and would prefer either a or b below. a) is a much simpler solution.
>>>>>>>> 
>>>>>>>> Any reason that why it is not a good idea?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Other choices are:
>>>>>>>> a> Keep a map in DwarfDebug
>>>>>>>> Keep in mind that the map is used at many stages, and it has to be in sync
>>>>>>>> with MDNodeSet.
>>>>>>>> b> Generalize MDNode to be aware of hash (David can provide more details)
>>>>>>>> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer, ll
>>>>>>>> reader|writer) to be aware of DINode
>>>>>>>> We can provide DINode::get(…) to create a DINode. DINode can have its own
>>>>>>>> Profile function.
>>>>>>>> Other suggestions are welcome.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> a or b please.
>>>>>>>> 
>>>>>>>> Option a> will require a DwarfDebug pointer in every stage of the compiler,
>>>>>>>> and passing the map to the DI classes.
>>>>>>>> A rough estimation is around 100 places.
>>>>>>>> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm linker?
>>>>>>>> Also the map needs to be in sync with MDNodeSet, maybe using ValueHandle can
>>>>>>>> solve the problem.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> What about putting the map in LLVMContextImpl?
>>>>>>>> It already has a few things specifically for debug info:
>>>>>>>> std::vector<DebugRecVH> ScopeRecords;
>>>>>>>> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
>>>>>>>> …
>>>>>>>> 
>>>>>>>> I remember David mentioned it once and I forgot about the conclusion.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I mentioned it only as speculation as to how you were implementing it
>>>>>>>> already (but you were doing the profile-changing stuff).
>>>>>>>> 
>>>>>>>> I don't think it should be necessary to have the map (in option (a))
>>>>>>>> in such a central location as LLVMContext. It should be usable just
>>>>>>>> from DwarfDebug for generation, and DIBuilder can have its own,
>>>>>>>> separate map to do similar things during DI building.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We also need the map during llvm linking since linking will create new
>>>>>>>> MDNodes.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't understand what you mean by this. IR linking shouldn't need to
>>>>>>>> do anything debug-info-specific, it should just be the normal IR
>>>>>>>> approach.
>>>>>>>> 
>>>>>>>> The declaration-v-definition resolution can be done at codegen-time.
>>>>>>>> We'll have to walk all the lists of retained types anyway to build the
>>>>>>>> map, so we can do declaration-v-definition (keep definitions over
>>>>>>>> declarations when we see both) at that point.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes, you are right that at codegen-time, we can generate the map from the
>>>>>>>> lists
>>>>>>>> of retained types.
>>>>>>>> 
>>>>>>>> But dumping the linked ll file requires the map when outputting comments of
>>>>>>>> the MDNode :]
>>>>>>> 
>>>>>>> Depending on which things we print out, but yes, in some cases
>>>>>>> (derived types) we do print out the type referenced. I assume the
>>>>>>> AsmPrinter can build such a map, then. (in fact, with a few clients
>>>>>>> like this, it might be nice to build a bit of an abstraction around it
>>>>>>> rather than just using a raw map - something that has a ctor (or I
>>>>>>> suppose it could be a factory function) that reads in the right
>>>>>>> metadata, walks the compile units, etc, and builds the mapping)
>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Manman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> So any other opinion on putting it in LLVMContext other than it being
>>>>>>>> central?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Manman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Manman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> More details for option b from David
>>>>>>>> 
>>>>>>>> < The alternative I have in mind is a more complete version of what
>>>>>>>> < you're proposing - a full MD feature, not an MD feature that's just
>>>>>>>> < barely enough to support the needs of debug info. What we could do is
>>>>>>>> < allow the insertion of these MDHash things you spoke about but take it
>>>>>>>> < a step further and have MDNode::getOperand walk through the hash &
>>>>>>>> < give the value (in this way, DebugInfo wouldn't have to change at all
>>>>>>>> < to handle hashes - if the Metadata APIs are going to be aware of the
>>>>>>>> < hashes anyway, they might as well provide this convenience
>>>>>>>> < functionality) the metadata feature would also have to have some
>>>>>>>> < blessed top-level named metadata that would have a list of hash+MDNode
>>>>>>>> < to keep those MDNodes alive (so you wouldn't have to stuff all the
>>>>>>>> < types in the retained types list - metadata would provide the full
>>>>>>>> < support, not just half of it).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Transition from current DI Metadata:
>>>>>>>> To have a smooth transition, I will add a flag "-gtype-hashing" for the type
>>>>>>>> uniquing work and turn it on by default when we are ready.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'd prefer just make the change to have the front end emit the "hash"
>>>>>>>> (it's not really a hash, it's just a string).
>>>>>>>> 
>>>>>>>> Are you saying no transition period? A single patch to have correct handling
>>>>>>>> of "hash" and to update all existing testing cases?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----------------------------
>>>>>>>> Patches:
>>>>>>>> Expect the following patches:
>>>>>>>> 1> add flag -gtype-hashing
>>>>>>>> 2> add hash field to DI types
>>>>>>>> 3> modify DIBuilder to use hash instead of MD reference
>>>>>>>> 4> related to issue 3
>>>>>>>> 
>>>>>>>> 
>>>>>>>> These can all be a single patch since it shouldn't be very large if we
>>>>>>>> go with a) above. If we go with b) then the MDNode work should be done
>>>>>>>> in isolation first and then the debug info on top of it.
>>>>>>>> 
>>>>>>>> What is wrong with smaller patches?
>>>>>>>> My estimation for all the above with a) is about 30K + testing cases.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 5> backend change (in DwarfDebug|CompileUnit) to support types shared among
>>>>>>>> compile units
>>>>>>>> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to ref_addr
>>>>>>>> 
>>>>>>>> 
>>>>>>>> #5 can and should be done before the rest of them.
>>>>>>>> 
>>>>>>>> I prefer to submit patches according to the flow of the compiler, starting
>>>>>>>> from the frontend, then IR, then backend.
>>>>>>>> The testing cases will be added for front end, llvm-link and backend.
>>>>>>>> Any reason why #5 should be done first?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> All changes should be local to debug info classes except patch #4.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> What's patch #4?
>>>>>>>> 
>>>>>>>> Patch #4 above: related to issue 3 (changes corresponding to how to solve
>>>>>>>> issue #3)
>>>>>>>> 
>>>>>>>> -Manman
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -eric
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130621/31726c65/attachment.html>