[LLVMdev] Proposal: type uniquing of debug info for LTO

Fri Jun 21 11:23:22 PDT 2013

On Fri, Jun 21, 2013 at 10:39 AM, Manman Ren <mren at apple.com> wrote:
>
> On Jun 21, 2013, at 10:18 AM, David Blaikie wrote:
>
> On Fri, Jun 21, 2013 at 10:13 AM, Manman Ren <mren at apple.com> wrote:
>
>
> More details please :]
>
> What do you mean by "name the metadata"? Are you referring to the name field
>
> of the MDNode?
>
>
> Using named metadata rather than unnamed metadata.
>
> rather than having:
>
> !llvm.hardref = !{metadata !"foo.h::myClass", !3, metadata
> !"bar.h:myEnum", !4} ...
> !3 = ...;
> !4 = ...;
>
> we could simply have:
>
> !llvm.hardref.foo.h.myClass = ...;
> !llvm.hardref.bar.h.myClass = ...;
>
> or something like that.
>
>
> From the documentation:
> NamedMDNode - a tuple of MDNodes. Despite its name, a NamedMDNode isn't
> itself an MDNode. NamedMDNodes belong to modules, have names, and contain
> lists of MDNodes.
>
> Our DI classes all wrap around a MDNode. Also during IR linking, NamedMDNode
> is handled differently.
> Did I miss something?

Nope - just that I'm not especially familiar with this stuff, as I've mentioned.

So we could change from:

!llvm.hardref = !{metadata !"foo.h::myClass", !3, metadata
!"bar.h:myEnum", !4} ...
!3 = ...;
!4 = ...;

to

!llvm.hardref.foo.h.myClass = !3;
!llvm.hardref.bar.h.myClass = !4;

But that might be less beneficial, since there's still that indirection, etc.

(in any case this is orthogonal to the major issue under discussion -
it's equally applicable to either (a) or (b))

>
> Thanks,
> Manman
>
>
>
> (assuming solution (b), in solution (a) it'd look more like
> "llvm.dbg.type.foo.h.myClass" - but the same idea (then we could skip
> using the retained types list, avoid adding a new 'name' member to
> DITypes or changing the schema of the retained types list to be
> string/mdnode pairs))
>
>
> Thanks,
>
> Manman
>
>
> On Jun 21, 2013, at 7:19 AM, David Blaikie wrote:
>
>
>
> On Jun 21, 2013 1:19 AM, "Jeremy Lakeman" <Jeremy.Lakeman at gmail.com> wrote:
>
>
> Wouldn't it be simpler to name the metadata based on the hash of the
>
> content? Then you could use a normal reference to that metadata without
>
> needing to create a new type or teach the rest of llvm how to use it...
>
>
> Sounds reasonable (probably with some nice prefix to get these things a
>
> namespace). Though this applies to both (a) and (b) equally.
>
>
>
>
> On Fri, Jun 21, 2013 at 3:22 PM, Manman Ren <mren at apple.com> wrote:
>
>
>
> A summary of options for issue #3:
>
> 3> To actually access the MDNode referenced via the hash value, we need
>
> to perform a lookup from the hash value to find the corresponding MDNode.
>
> The questions are where to store this map and how to keep it up-to-date
>
> when a MDNode is replaced.
>
> ---------------------
>
> Option a) a map in DwarfDebug, AsmWriter and DIBuilder, modify access
>
> functions in DI classes to pass in the map.
>
> May need to modify printing functions in AsmWriter.
>
> ---------------------
>
> Option b) I am going to expand David's description with more details,
>
> correct me if I am wrong.
>
>   A special Value called MDHash in parallel with MDString, MDHash needs
>
> to specify whether it is a use of the hash value, or a definition of the
>
> hash value
>
>   The updated class hierarchy:
>
>   Value -- MDNode
>
>               -- MDHash
>
>               -- MDString
>
>               -- User -- Instruction -- BinaryOperator
>
>
>   In the following example:
>
> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1,
>
> i64 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>
> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32,
>
> offset 0] [from ]
>
> !6 = metadata !{metadata !7, metadata !9}
>
> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a",
>
> i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>
> [line 2, size 32, align 32, offset 0] [from int]
>
> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base",
>
> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32
>
> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ;
>
> [ DW_TAG_subprogram ] [line 1] [Base]
>
>   915398439 in !4 is the definition of a hash value, 915398439 in !7 or
>
> !9 is the use of a hash value.
>
>
> A separate map "HashToNodeMap" is added to LLVMContext.
>
> MDNode will handle MDHash correctly in the sense that it will update
>
> HashToNodeMap whenever it creates a MDNode with a MDHash operand.
>
> MDNode::getOperand will perform a lookup if the actual operand is MDHash
>
> and return the corresponding MDNode.
>
>
> I am not sure whether a named metadata to describe the map is necessary,
>
> assuming we already have the retained type list for each CU.
>
> Also at IR linking, the named metadata from all CUs will be combined,
>
> leading to a long list.
>
>
> ---------------------
>
> David's description is pasted here for convenience:
>
> a blessed named MD node root that would contain string+node pairs
>
> (alternating elements probably, for space conservation)
>
>
> a means to add an operand to an MD node with an "indirect"
>
> reference - this would cause the string to be used as the value (with
>
> some flag that it's a 'special' string) but any access of the node
>
> through the operand list would return the underlying MDNode from the
>
> list (lookup via a map in LLVMContext, essentially).
>
>
> ---------------------
>
> Options that are vetoed:
>
> 1> Specialize Profile function for DI nodes to use MDNodeSet to support
>
> lookup. No need for a separate map.
>
> 2> Extend the class hierarchy to have a class for DI nodes:
>
> Value -- MDNode -- DINode
>
>             -- MDString
>
>             -- User -- Instruction -- BinaryOperator
>
>
> I personally prefer option b over a since it is much cleaner.
>
>
> Thanks,
>
> Manman
>
>
> On Jun 20, 2013, at 5:39 PM, David Blaikie wrote:
>
>
> On Thu, Jun 20, 2013 at 5:25 PM, Manman Ren <mren at apple.com> wrote:
>
>
> On Jun 20, 2013, at 5:18 PM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
> On Thu, Jun 20, 2013 at 5:13 PM, Manman Ren <mren at apple.com> wrote:
>
>
>
> On Jun 20, 2013, at 4:52 PM, David Blaikie wrote:
>
>
> On Thu, Jun 20, 2013 at 4:45 PM, Manman Ren <mren at apple.com> wrote:
>
>
>
> On Jun 20, 2013, at 3:55 PM, Manman Ren <mren at apple.com> wrote:
>
>
>
> On Jun 20, 2013, at 2:58 PM, Eric Christopher wrote:
>
>
> Hi Manman,
>
>
> On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>
>
>
> The intent of this proposal is to speedup compilation of "-flto -g"
>
> for c++
>
> programs.
>
> This is based on discussions with Adrian, David and Eric.
>
>
>
> Thanks for bringing this back to the list. The original thread was
>
> getting quite long.
>
>
> ---------------------------
>
> Problem:
>
> A single class can be used in multiple source files and the DI (Debug
>
> Info)
>
> class is included in multiple bc files. The duplication of
>
> class definitions causes blow-up in # of MDNodes, # of DIEs, leading
>
> to
>
> large memory requirement.
>
>
> As an example, SPEC xalancbmk requires 7GB of memory when compiled
>
> with
>
> -flto -g.
>
> With a preliminary implementation of type uniquing, the memory usage
>
> will be
>
> down to 2GB.
>
>
> In order to unique types, we have to break cycles in the MDNodes.
>
>
> A simple struct definition
>
> struct Base {
>
> int a;
>
> };
>
> can cause cycles in MDNodes:
>
> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32
>
> 1,
>
> i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ;
>
> [
>
> DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0]
>
> [from ]
>
> !14 = metadata !{metadata !15, metadata !16}
>
> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata
>
> !"a", i32
>
> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>
> [line
>
> 2, size 32, align 32, offset 0] [from int]
>
> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata
>
> !"Base",
>
> metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1
>
> false, i32
>
> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20,
>
> i32 1} ;
>
> [ DW_TAG_subprogram ] [line 1] [Base]
>
>
> Cycles: !12 -- !14 -- !15 -- !12
>
>       !12 -- !14 -- !16 -- !12
>
>
> These cycles make it hard to unique the same struct used in two bc
>
> files.
>
>
> ---------------------------
>
> How to fix:
>
>
> We attach a hash value to types to help type uniquing and we also
>
> replace
>
> references to types with their hash values.
>
> For the above struct "Base", we now have the following MDNodes:
>
> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32
>
> 1, i64
>
> 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32
>
> 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align
>
> 32,
>
> offset 0] [from ]
>
> !6 = metadata !{metadata !7, metadata !9}
>
> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a",
>
> i32
>
> 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a]
>
> [line
>
> 2, size 32, align 32, offset 0] [from int]
>
> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata
>
> !"Base",
>
> metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1
>
> false, i32
>
> 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13,
>
> i32 1} ;
>
> [ DW_TAG_subprogram ] [line 1] [Base]
>
>
> Note that the cycles are gone and !4 has a hash value of 915398439,
>
> and the
>
> references to !4 are replaced with 915398439.
>
> Thanks Eric for suggesting replacing MD reference with a hash value.
>
>
>
> In particular I recommended this:
>
>
> a) For C++ odr replace it with the "hash" that's just a string
>
> representing the type name.
>
> b) For Internal C++ types and all C types replace it with a string
>
> that's a concatenation of the type name and the name of the compile
>
> unit.
>
>
> Yes, that is what we agreed on over email.
>
>
>
>
> There are a few issues:
>
> 1> How to generate the hash for a given type?
>
> With C++'s ODR, it should be enough by using the context and the name
>
> for
>
> non-internal c++ types.
>
> For internal c++ types and types of other languages, hash will not be
>
> used.
>
>
>
> Explain this?
>
>
>
> For a while, I am going to support both hash and MD reference, once
>
> everything is working with hash,
>
> I will update all debug info testing cases, turn -gtype-uniquing on,
>
> and
>
> remove the other path.
>
>
> For internal c++ types, initially they will follow the path of using
>
> MD
>
> references without a hash.
>
>
>
> My current implementation is to add a few static member functions in
>
> MDNode
>
> to profile DI nodes differently.
>
> + /// If the array of Vals is for debug info, profile it specially and
>
> return true.
>
> + /// If the DI node has a hash value, generate the profile using only
>
> the
>
> hash value and the declaration flag.
>
> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals,
>
> FoldingSetNodeID
>
> &ID);
>
>
> + /// If the MDNode is for debug info, profile it specially and return
>
> true.
>
> + /// If the DI node has a hash value, generate the profile using only
>
> the
>
> hash value and the declaration flag.
>
> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID
>
> &ID);
>
>
> + /// Given a hash value and a flag, generate the profile for later
>
> lookup.
>
> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration,
>
> FoldingSetNodeID &ID);
>
>
> These static functions are called in Metadata.cpp:
>
> void MDNode::Profile(FoldingSetNodeID &ID) const {
>
> +  if (profileDebugInfoNode(this, ID))
>
> +    return;
>
> +
>
>
> There are other examples of these in MDNode for handling of specific
>
> metadata.
>
> /// Methods for metadata merging.
>
> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
>
> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
>
> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>
>
> Comments are welcome on whether this violates any layering rule.
>
>
>
> As I've said many times in email, I don't think this is a good idea
>
> and would prefer either a or b below. a) is a much simpler solution.
>
>
> Any reason that why it is not a good idea?
>
>
>
> Other choices are:
>
> a> Keep a map in DwarfDebug
>
> Keep in mind that the map is used at many stages, and it has to be in
>
> sync
>
> with MDNodeSet.
>
> b> Generalize MDNode to be aware of hash (David can provide more
>
> details)
>
> c> Extend MDNode to DINode and modify streamers (bitcode
>
> reader|writer, ll
>
> reader|writer) to be aware of DINode
>
> We can provide DINode::get(…) to create a DINode. DINode can have its
>
> own
>
> Profile function.
>
> Other suggestions are welcome.
>
>
>
> a or b please.
>
>
> Option a> will require a DwarfDebug pointer in every stage of the
>
> compiler,
>
> and passing the map to the DI classes.
>
> A rough estimation is around 100 places.
>
> Is it reasonable to pass a DwarfDebug pointer to DIBuilder and llvm
>
> linker?
>
> Also the map needs to be in sync with MDNodeSet, maybe using
>
> ValueHandle can
>
> solve the problem.
>
>
>
> What about putting the map in LLVMContextImpl?
>
> It already has a few things specifically for debug info:
>
> std::vector<DebugRecVH> ScopeRecords;
>
> DenseMap<std::pair<MDNode*, MDNode*>, int> ScopeInlinedAt;
>
> …
>
>
> I remember David mentioned it once and I forgot about the conclusion.
>
>
>
> I mentioned it only as speculation as to how you were implementing it
>
> already (but you were doing the profile-changing stuff).
>
>
> I don't think it should be necessary to have the map (in option (a))
>
> in such a central location as LLVMContext. It should be usable just
>
> from DwarfDebug for generation, and DIBuilder can have its own,
>
> separate map to do similar things during DI building.
>
>
>
> We also need the map during llvm linking since linking will create new
>
> MDNodes.
>
>
>
> I don't understand what you mean by this. IR linking shouldn't need to
>
> do anything debug-info-specific, it should just be the normal IR
>
> approach.
>
>
> The declaration-v-definition resolution can be done at codegen-time.
>
> We'll have to walk all the lists of retained types anyway to build the
>
> map, so we can do declaration-v-definition (keep definitions over
>
> declarations when we see both) at that point.
>
>
>
> Yes, you are right that at codegen-time, we can generate the map from
>
> the
>
> lists
>
> of retained types.
>
>
> But dumping the linked ll file requires the map when outputting
>
> comments of
>
> the MDNode :]
>
>
> Depending on which things we print out, but yes, in some cases
>
> (derived types) we do print out the type referenced. I assume the
>
> AsmPrinter can build such a map, then. (in fact, with a few clients
>
> like this, it might be nice to build a bit of an abstraction around it
>
> rather than just using a raw map - something that has a ctor (or I
>
> suppose it could be a factory function) that reads in the right
>
> metadata, walks the compile units, etc, and builds the mapping)
>
>
>
> Thanks,
>
> Manman
>
>
>
> So any other opinion on putting it in LLVMContext other than it being
>
> central?
>
>
> Thanks,
>
> Manman
>
>
>
>
> Thanks,
>
> Manman
>
>
>
> More details for option b from David
>
>
> < The alternative I have in mind is a more complete version of what
>
> < you're proposing - a full MD feature, not an MD feature that's just
>
> < barely enough to support the needs of debug info. What we could do
>
> is
>
> < allow the insertion of these MDHash things you spoke about but take
>
> it
>
> < a step further and have MDNode::getOperand walk through the hash &
>
> < give the value (in this way, DebugInfo wouldn't have to change at
>
> all
>
> < to handle hashes - if the Metadata APIs are going to be aware of the
>
> < hashes anyway, they might as well provide this convenience
>
> < functionality) the metadata feature would also have to have some
>
> < blessed top-level named metadata that would have a list of
>
> hash+MDNode
>
> < to keep those MDNodes alive (so you wouldn't have to stuff all the
>
> < types in the retained types list - metadata would provide the full
>
> < support, not just half of it).
>
>
>
>
> Transition from current DI Metadata:
>
> To have a smooth transition, I will add a flag "-gtype-hashing" for
>
> the type
>
> uniquing work and turn it on by default when we are ready.
>
>
>
> I'd prefer just make the change to have the front end emit the "hash"
>
> (it's not really a hash, it's just a string).
>
>
> Are you saying no transition period? A single patch to have correct
>
> handling
>
> of "hash" and to update all existing testing cases?
>
>
>
> -----------------------------
>
> Patches:
>
> Expect the following patches:
>
> 1> add flag -gtype-hashing
>
> 2> add hash field to DI types
>
> 3> modify DIBuilder to use hash instead of MD reference
>
> 4> related to issue 3
>
>
>
> These can all be a single patch since it shouldn't be very large if we
>
> go with a) above. If we go with b) then the MDNode work should be done
>
> in isolation first and then the debug info on top of it.
>
>
> What is wrong with smaller patches?
>
> My estimation for all the above with a) is about 30K + testing cases.
>
>
>
> 5> backend change (in DwarfDebug|CompileUnit) to support types shared
>
> among
>
> compile units
>
> requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to
>
> ref_addr
>
>
>
> #5 can and should be done before the rest of them.
>
>
> I prefer to submit patches according to the flow of the compiler,
>
> starting
>
> from the frontend, then IR, then backend.
>
> The testing cases will be added for front end, llvm-link and backend.
>
> Any reason why #5 should be done first?
>
>
>
> All changes should be local to debug info classes except patch #4.
>
>
>
> What's patch #4?
>
>
> Patch #4 above: related to issue 3 (changes corresponding to how to
>
> solve
>
> issue #3)
>
>
> -Manman
>
>
>
> -eric
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
>
> llvm-commits mailing list
>
> llvm-commits at cs.uiuc.edu
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
>