[LLVMdev] Proposal: type uniquing of debug info for LTO

Fri Jun 21 01:19:16 PDT 2013

Wouldn't it be simpler to name the metadata based on the hash of the
content? Then you could use a normal reference to that metadata without
needing to create a new type or teach the rest of llvm how to use it...

On Fri, Jun 21, 2013 at 3:22 PM, Manman Ren <mren at apple.com> wrote:

>
> A summary of options for issue #3:
> 3> To actually access the MDNode referenced via the hash value, we need to
> perform a lookup from the hash value to find the corresponding MDNode.
> The questions are where to store this map and how to keep it up-to-date
> when a MDNode is replaced.
> ---------------------
> Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
> functions in DI classes to pass in the map.
> May need to modify printing functions in AsmWriter.
> ---------------------
> Option b) I am going to expand David's description with more details,
> correct me if I am wrong.
>    A special Value called MDHash in parallel with MDString, MDHash needs
> to specify whether it is a use of the hash value, or a definition of the
> hash value
>    The updated class hierarchy:
>    Value -- MDNode
>                -- MDHash
>                -- MDString
>                -- User -- Instruction -- BinaryOperator
>
>    In the following example:
> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1,
> i64 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
> offset 0] [from ]
> !6 = metadata !{metadata !7, metadata !9}
> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a", i32
> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line
> 2, size 32, align 32, offset 0] [from int]
> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base",
> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false,
> i32 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32
> 1} ; [ DW_TAG_subprogram ] [line 1] [Base]
>    915398439 in !4 is the definition of a hash value, 915398439 in !7 or
> !9 is the use of a hash value.
>
> A separate map "HashToNodeMap" is added to LLVMContext.
> MDNode will handle MDHash correctly in the sense that it will update
> HashToNodeMap whenever it creates a MDNode with a MDHash operand.
> MDNode::getOperand will perform a lookup if the actual operand is MDHash
> and return the corresponding MDNode.
>
> I am not sure whether a named metadata to describe the map is necessary,
> assuming we already have the retained type list for each CU.
> Also at IR linking, the named metadata from all CUs will be combined,
> leading to a long list.
>
> ---------------------
> David's description is pasted here for convenience:
> a blessed named MD node root that would contain string+node pairs
> (alternating elements probably, for space conservation)
>
> a means to add an operand to an MD node with an "indirect"
> reference - this would cause the string to be used as the value (with
> some flag that it's a 'special' string) but any access of the node
> through the operand list would return the underlying MDNode from the
> list (lookup via a map in LLVMContext, essentially).
>
> ---------------------
> Options that are vetoed:
> 1> Specialize Profile function for DI nodes to use MDNodeSet to support
> lookup. No need for a separate map.
> 2> Extend the class hierarchy to have a class for DI nodes:
>  Value -- MDNode -- DINode
>              -- MDString
>              -- User -- Instruction -- BinaryOperator
>
> I personally prefer option b over a since it is much cleaner.
>
> Thanks,
> Manman
>
> On Jun 20, 2013, at 5:39 PM, David Blaikie wrote:
>
> > On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com> wrote:
> >>
> >> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
> >>
> >> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com> wrote:
> >>
> >>
> >> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
> >>
> >> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com> wrote:
> >>
> >>
> >> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
> >>
> >>
> >> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
> >>
> >> Hi Manman,
> >>
> >> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
> >>
> >>
> >> The intent of this proposal is to speedup compilation of "-flto -g" for
> c++
> >> programs.
> >> This is based on discussions with Adrian, David and Eric.
> >>
> >>
> >> Thanks for bringing this back to the list. The original thread was
> >> getting quite long.
> >>
> >> ---------------------------
> >> Problem:
> >> A single class can be used in multiple source files and the DI (Debug
> Info)
> >> class is included in multiple bc files. The duplication of
> >> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to
> >> large memory requirement.
> >>
> >> As an example, SPEC xalancbmk requires 7GB of memory when compiled with
> >> -flto -g.
> >> With a preliminary implementation of type uniquing, the memory usage
> will be
> >> down to 2GB.
> >>
> >> In order to unique types, we have to break cycles in the MDNodes.
> >>
> >> A simple struct definition
> >> struct Base {
> >> int a;
> >> };
> >> can cause cycles in MDNodes:
> >> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32
> 1,
> >> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [
> >> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0]
> [from ]
> >> !14 = metadata !{metadata !15, metadata !16}
> >> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata !"a",
> i32
> >> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
> [line
> >> 2, size 32, align 32, offset 0] [from int]
> >> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata
> !"Base",
> >> metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1
> false, i32
> >> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32
> 1} ;
> >> [ DW_TAG_subprogram ] [line 1] [Base]
> >>
> >> Cycles: !12 -- !14 -- !15 -- !12
> >>        !12 -- !14 -- !16 -- !12
> >>
> >> These cycles make it hard to unique the same struct used in two bc
> files.
> >>
> >> ---------------------------
> >> How to fix:
> >>
> >> We attach a hash value to types to help type uniquing and we also
> replace
> >> references to types with their hash values.
> >> For the above struct "Base", we now have the following MDNodes:
> >> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1,
> i64
> >> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
> >> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align
> 32,
> >> offset 0] [from ]
> >> !6 = metadata !{metadata !7, metadata !9}
> >> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a",
> i32
> >> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
> [line
> >> 2, size 32, align 32, offset 0] [from int]
> >> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata
> !"Base",
> >> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1
> false, i32
> >> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32
> 1} ;
> >> [ DW_TAG_subprogram ] [line 1] [Base]
> >>
> >> Note that the cycles are gone and !4 has a hash value of 915398439, and
> the
> >> references to !4 are replaced with 915398439.
> >> Thanks Eric for suggesting replacing MD reference with a hash value.
> >>
> >>
> >> In particular I recommended this:
> >>
> >> a) For C++ odr replace it with the "hash" that's just a string
> >> representing the type name.
> >> b) For Internal C++ types and all C types replace it with a string
> >> that's a concatenation of the type name and the name of the compile
> >> unit.
> >>
> >> Yes, that is what we agreed on over email.
> >>
> >>
> >>
> >> There are a few issues:
> >> 1> How to generate the hash for a given type?
> >> With C++'s ODR, it should be enough by using the context and the name
> for
> >> non-internal c++ types.
> >> For internal c++ types and types of other languages, hash will not be
> used.
> >>
> >>
> >> Explain this?
> >>
> >>
> >> For a while, I am going to support both hash and MD reference, once
> >> everything is working with hash,
> >> I will update all debug info testing cases, turn -gtype-uniquing on, and
> >> remove the other path.
> >>
> >> For internal c++ types, initially they will follow the path of using MD
> >> references without a hash.
> >>
> >>
> >> My current implementation is to add a few static member functions in
> MDNode
> >> to profile DI nodes differently.
> >> + /// If the array of Vals is for debug info, profile it specially and
> >> return true.
> >> + /// If the DI node has a hash value, generate the profile using only
> the
> >> hash value and the declaration flag.
> >> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals,
> FoldingSetNodeID
> >> &ID);
> >>
> >> + /// If the MDNode is for debug info, profile it specially and return
> true.
> >> + /// If the DI node has a hash value, generate the profile using only
> the
> >> hash value and the declaration flag.
> >> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID
> &ID);
> >>
> >> + /// Given a hash value and a flag, generate the profile for later
> lookup.
> >> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
> >> FoldingSetNodeID &ID);
> >>
> >> These static functions are called in Metadata.cpp:
> >> void MDNode::Profile(FoldingSetNodeID &ID) const {
> >> +  if (profileDebugInfoNode(this, ID))
> >> +    return;
> >> +
> >>
> >> There are other examples of these in MDNode for handling of specific
> >> metadata.
> >> /// Methods for metadata merging.
> >> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
> >> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
> >> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
> >>
> >> Comments are welcome on whether this violates any layering rule.
> >>
> >>
> >> As I've said many times in email, I don't think this is a good idea
> >> and would prefer either a or b below. a) is a much simpler solution.
> >>
> >> Any reason that why it is not a good idea?
> >>
> >>
> >> Other choices are:
> >> a> Keep a map in DwarfDebug
> >> Keep in mind that the map is used at many stages, and it has to be in
> sync
> >> with MDNodeSet.
> >> b> Generalize MDNode to be aware of hash (David can provide more
> details)
> >> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer,
> ll
> >> reader|writer) to be aware of DINode
> >> We can provide DINode::get(…) to create a DINode. DINode can have its
> own
> >> Profile function.
> >> Other suggestions are welcome.
> >>
> >>
> >> a or b please.
> >>
> >> Option a> will require a DwarfDebug pointer in every stage of the
> compiler,
> >> and passing the map to the DI classes.
> >> A rough estimation is around 100 places.
> >> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm
> linker?
> >> Also the map needs to be in sync with MDNodeSet, maybe using
> ValueHandle can
> >> solve the problem.
> >>
> >>
> >> What about putting the map in LLVMContextImpl?
> >> It already has a few things specifically for debug info:
> >> std::vector<DebugRecVH> ScopeRecords;
> >> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
> >> …
> >>
> >> I remember David mentioned it once and I forgot about the conclusion.
> >>
> >>
> >> I mentioned it only as speculation as to how you were implementing it
> >> already (but you were doing the profile-changing stuff).
> >>
> >> I don't think it should be necessary to have the map (in option (a))
> >> in such a central location as LLVMContext. It should be usable just
> >> from DwarfDebug for generation, and DIBuilder can have its own,
> >> separate map to do similar things during DI building.
> >>
> >>
> >> We also need the map during llvm linking since linking will create new
> >> MDNodes.
> >>
> >>
> >> I don't understand what you mean by this. IR linking shouldn't need to
> >> do anything debug-info-specific, it should just be the normal IR
> >> approach.
> >>
> >> The declaration-v-definition resolution can be done at codegen-time.
> >> We'll have to walk all the lists of retained types anyway to build the
> >> map, so we can do declaration-v-definition (keep definitions over
> >> declarations when we see both) at that point.
> >>
> >>
> >> Yes, you are right that at codegen-time, we can generate the map from
> the
> >> lists
> >> of retained types.
> >>
> >> But dumping the linked ll file requires the map when outputting
> comments of
> >> the MDNode :]
> >
> > Depending on which things we print out, but yes, in some cases
> > (derived types) we do print out the type referenced. I assume the
> > AsmPrinter can build such a map, then. (in fact, with a few clients
> > like this, it might be nice to build a bit of an abstraction around it
> > rather than just using a raw map - something that has a ctor (or I
> > suppose it could be a factory function) that reads in the right
> > metadata, walks the compile units, etc, and builds the mapping)
> >
> >>
> >> Thanks,
> >> Manman
> >>
> >>
> >> So any other opinion on putting it in LLVMContext other than it being
> >> central?
> >>
> >> Thanks,
> >> Manman
> >>
> >>
> >>
> >> Thanks,
> >> Manman
> >>
> >>
> >> More details for option b from David
> >>
> >> < The alternative I have in mind is a more complete version of what
> >> < you're proposing - a full MD feature, not an MD feature that's just
> >> < barely enough to support the needs of debug info. What we could do is
> >> < allow the insertion of these MDHash things you spoke about but take it
> >> < a step further and have MDNode::getOperand walk through the hash &
> >> < give the value (in this way, DebugInfo wouldn't have to change at all
> >> < to handle hashes - if the Metadata APIs are going to be aware of the
> >> < hashes anyway, they might as well provide this convenience
> >> < functionality) the metadata feature would also have to have some
> >> < blessed top-level named metadata that would have a list of hash+MDNode
> >> < to keep those MDNodes alive (so you wouldn't have to stuff all the
> >> < types in the retained types list - metadata would provide the full
> >> < support, not just half of it).
> >>
> >>
> >>
> >> Transition from current DI Metadata:
> >> To have a smooth transition, I will add a flag "-gtype-hashing" for the
> type
> >> uniquing work and turn it on by default when we are ready.
> >>
> >>
> >> I'd prefer just make the change to have the front end emit the "hash"
> >> (it's not really a hash, it's just a string).
> >>
> >> Are you saying no transition period? A single patch to have correct
> handling
> >> of "hash" and to update all existing testing cases?
> >>
> >>
> >> -----------------------------
> >> Patches:
> >> Expect the following patches:
> >> 1> add flag -gtype-hashing
> >> 2> add hash field to DI types
> >> 3> modify DIBuilder to use hash instead of MD reference
> >> 4> related to issue 3
> >>
> >>
> >> These can all be a single patch since it shouldn't be very large if we
> >> go with a) above. If we go with b) then the MDNode work should be done
> >> in isolation first and then the debug info on top of it.
> >>
> >> What is wrong with smaller patches?
> >> My estimation for all the above with a) is about 30K + testing cases.
> >>
> >>
> >> 5> backend change (in DwarfDebug|CompileUnit) to support types shared
> among
> >> compile units
> >> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to
> ref_addr
> >>
> >>
> >> #5 can and should be done before the rest of them.
> >>
> >> I prefer to submit patches according to the flow of the compiler,
> starting
> >> from the frontend, then IR, then backend.
> >> The testing cases will be added for front end, llvm-link and backend.
> >> Any reason why #5 should be done first?
> >>
> >>
> >> All changes should be local to debug info classes except patch #4.
> >>
> >>
> >> What's patch #4?
> >>
> >> Patch #4 above: related to issue 3 (changes corresponding to how to
> solve
> >> issue #3)
> >>
> >> -Manman
> >>
> >>
> >> -eric
> >>
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>
> >>
> >>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>
> >>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130621/afcc526f/attachment.html>