[LLVMdev] Proposal: type uniquing of debug info for LTO

Thu Jun 20 18:48:10 PDT 2013

On Jun 20, 2013, at 5:39 PM, David Blaikie wrote:

> On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
>> 
>> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> 
>> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>> 
>> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> 
>> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> 
>> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>> 
>> Hi Manman,
>> 
>> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>> 
>> 
>> The intent of this proposal is to speedup compilation of "-flto -g" for c++
>> programs.
>> This is based on discussions with Adrian, David and Eric.
>> 
>> 
>> Thanks for bringing this back to the list. The original thread was
>> getting quite long.
>> 
>> ---------------------------
>> Problem:
>> A single class can be used in multiple source files and the DI (Debug Info)
>> class is included in multiple bc files. The duplication of
>> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to
>> large memory requirement.
>> 
>> As an example, SPEC xalancbmk requires 7GB of memory when compiled with
>> -flto -g.
>> With a preliminary implementation of type uniquing, the memory usage will be
>> down to 2GB.
>> 
>> In order to unique types, we have to break cycles in the MDNodes.
>> 
>> A simple struct definition
>> struct Base {
>> int a;
>> };
>> can cause cycles in MDNodes:
>> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32 1,
>> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [
>> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
>> !14 = metadata !{metadata !15, metadata !16}
>> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata !"a", i32
>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
>> 2, size 32, align 32, offset 0] [from int]
>> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata !"Base",
>> metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1 false, i32
>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32 1} ;
>> [ DW_TAG_subprogram ] [line 1] [Base]
>> 
>> Cycles: !12 -- !14 -- !15 -- !12
>>        !12 -- !14 -- !16 -- !12
>> 
>> These cycles make it hard to unique the same struct used in two bc files.
>> 
>> ---------------------------
>> How to fix:
>> 
>> We attach a hash value to types to help type uniquing and we also replace
>> references to types with their hash values.
>> For the above struct "Base", we now have the following MDNodes:
>> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1, i64
>> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
>> offset 0] [from ]
>> !6 = metadata !{metadata !7, metadata !9}
>> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a", i32
>> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
>> 2, size 32, align 32, offset 0] [from int]
>> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base",
>> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32
>> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ;
>> [ DW_TAG_subprogram ] [line 1] [Base]
>> 
>> Note that the cycles are gone and !4 has a hash value of 915398439, and the
>> references to !4 are replaced with 915398439.
>> Thanks Eric for suggesting replacing MD reference with a hash value.
>> 
>> 
>> In particular I recommended this:
>> 
>> a) For C++ odr replace it with the "hash" that's just a string
>> representing the type name.
>> b) For Internal C++ types and all C types replace it with a string
>> that's a concatenation of the type name and the name of the compile
>> unit.
>> 
>> Yes, that is what we agreed on over email.
>> 
>> 
>> 
>> There are a few issues:
>> 1> How to generate the hash for a given type?
>> With C++'s ODR, it should be enough by using the context and the name for
>> non-internal c++ types.
>> For internal c++ types and types of other languages, hash will not be used.
>> 
>> 
>> Explain this?
>> 
>> 
>> For a while, I am going to support both hash and MD reference, once
>> everything is working with hash,
>> I will update all debug info testing cases, turn -gtype-uniquing on, and
>> remove the other path.
>> 
>> For internal c++ types, initially they will follow the path of using MD
>> references without a hash.
>> 
>> 
>> My current implementation is to add a few static member functions in MDNode
>> to profile DI nodes differently.
>> + /// If the array of Vals is for debug info, profile it specially and
>> return true.
>> + /// If the DI node has a hash value, generate the profile using only the
>> hash value and the declaration flag.
>> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals, FoldingSetNodeID
>> &ID);
>> 
>> + /// If the MDNode is for debug info, profile it specially and return true.
>> + /// If the DI node has a hash value, generate the profile using only the
>> hash value and the declaration flag.
>> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID &ID);
>> 
>> + /// Given a hash value and a flag, generate the profile for later lookup.
>> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
>> FoldingSetNodeID &ID);
>> 
>> These static functions are called in Metadata.cpp:
>> void MDNode::Profile(FoldingSetNodeID &ID) const {
>> +  if (profileDebugInfoNode(this, ID))
>> +    return;
>> +
>> 
>> There are other examples of these in MDNode for handling of specific
>> metadata.
>> /// Methods for metadata merging.
>> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>> 
>> Comments are welcome on whether this violates any layering rule.
>> 
>> 
>> As I've said many times in email, I don't think this is a good idea
>> and would prefer either a or b below. a) is a much simpler solution.
>> 
>> Any reason that why it is not a good idea?
>> 
>> 
>> Other choices are:
>> a> Keep a map in DwarfDebug
>> Keep in mind that the map is used at many stages, and it has to be in sync
>> with MDNodeSet.
>> b> Generalize MDNode to be aware of hash (David can provide more details)
>> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer, ll
>> reader|writer) to be aware of DINode
>> We can provide DINode::get(…) to create a DINode. DINode can have its own
>> Profile function.
>> Other suggestions are welcome.
>> 
>> 
>> a or b please.
>> 
>> Option a> will require a DwarfDebug pointer in every stage of the compiler,
>> and passing the map to the DI classes.
>> A rough estimation is around 100 places.
>> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm linker?
>> Also the map needs to be in sync with MDNodeSet, maybe using ValueHandle can
>> solve the problem.
>> 
>> 
>> What about putting the map in LLVMContextImpl?
>> It already has a few things specifically for debug info:
>> std::vector<DebugRecVH> ScopeRecords;
>> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
>> …
>> 
>> I remember David mentioned it once and I forgot about the conclusion.
>> 
>> 
>> I mentioned it only as speculation as to how you were implementing it
>> already (but you were doing the profile-changing stuff).
>> 
>> I don't think it should be necessary to have the map (in option (a))
>> in such a central location as LLVMContext. It should be usable just
>> from DwarfDebug for generation, and DIBuilder can have its own,
>> separate map to do similar things during DI building.
>> 
>> 
>> We also need the map during llvm linking since linking will create new
>> MDNodes.
>> 
>> 
>> I don't understand what you mean by this. IR linking shouldn't need to
>> do anything debug-info-specific, it should just be the normal IR
>> approach.
>> 
>> The declaration-v-definition resolution can be done at codegen-time.
>> We'll have to walk all the lists of retained types anyway to build the
>> map, so we can do declaration-v-definition (keep definitions over
>> declarations when we see both) at that point.
>> 
>> 
>> Yes, you are right that at codegen-time, we can generate the map from the
>> lists
>> of retained types.
>> 
>> But dumping the linked ll file requires the map when outputting comments of
>> the MDNode :]
> 
> Depending on which things we print out, but yes, in some cases
> (derived types) we do print out the type referenced. I assume the
> AsmPrinter can build such a map, then. (in fact, with a few clients
> like this, it might be nice to build a bit of an abstraction around it
> rather than just using a raw map - something that has a ctor (or I
> suppose it could be a factory function) that reads in the right
> metadata, walks the compile units, etc, and builds the mapping)

A call to MDNode::dump() will require the map to print out the comments.
To continue supporting dump(), we need to have the map at almost all times.

Thanks,
Manman

> 
>> 
>> Thanks,
>> Manman
>> 
>> 
>> So any other opinion on putting it in LLVMContext other than it being
>> central?
>> 
>> Thanks,
>> Manman
>> 
>> 
>> 
>> Thanks,
>> Manman
>> 
>> 
>> More details for option b from David
>> 
>> < The alternative I have in mind is a more complete version of what
>> < you're proposing - a full MD feature, not an MD feature that's just
>> < barely enough to support the needs of debug info. What we could do is
>> < allow the insertion of these MDHash things you spoke about but take it
>> < a step further and have MDNode::getOperand walk through the hash &
>> < give the value (in this way, DebugInfo wouldn't have to change at all
>> < to handle hashes - if the Metadata APIs are going to be aware of the
>> < hashes anyway, they might as well provide this convenience
>> < functionality) the metadata feature would also have to have some
>> < blessed top-level named metadata that would have a list of hash+MDNode
>> < to keep those MDNodes alive (so you wouldn't have to stuff all the
>> < types in the retained types list - metadata would provide the full
>> < support, not just half of it).
>> 
>> 
>> 
>> Transition from current DI Metadata:
>> To have a smooth transition, I will add a flag "-gtype-hashing" for the type
>> uniquing work and turn it on by default when we are ready.
>> 
>> 
>> I'd prefer just make the change to have the front end emit the "hash"
>> (it's not really a hash, it's just a string).
>> 
>> Are you saying no transition period? A single patch to have correct handling
>> of "hash" and to update all existing testing cases?
>> 
>> 
>> -----------------------------
>> Patches:
>> Expect the following patches:
>> 1> add flag -gtype-hashing
>> 2> add hash field to DI types
>> 3> modify DIBuilder to use hash instead of MD reference
>> 4> related to issue 3
>> 
>> 
>> These can all be a single patch since it shouldn't be very large if we
>> go with a) above. If we go with b) then the MDNode work should be done
>> in isolation first and then the debug info on top of it.
>> 
>> What is wrong with smaller patches?
>> My estimation for all the above with a) is about 30K + testing cases.
>> 
>> 
>> 5> backend change (in DwarfDebug|CompileUnit) to support types shared among
>> compile units
>> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to ref_addr
>> 
>> 
>> #5 can and should be done before the rest of them.
>> 
>> I prefer to submit patches according to the flow of the compiler, starting
>> from the frontend, then IR, then backend.
>> The testing cases will be added for front end, llvm-link and backend.
>> Any reason why #5 should be done first?
>> 
>> 
>> All changes should be local to debug info classes except patch #4.
>> 
>> 
>> What's patch #4?
>> 
>> Patch #4 above: related to issue 3 (changes corresponding to how to solve
>> issue #3)
>> 
>> -Manman
>> 
>> 
>> -eric
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>>