[LLVMdev] Proposal: type uniquing of debug info for LTO

Thu Jun 20 14:58:07 PDT 2013

Hi Manman,

On Thu, Jun 20, 2013 at 2:51 PM, Manman Ren <mren at apple.com> wrote:
>
> The intent of this proposal is to speedup compilation of "-flto -g" for c++ programs.
> This is based on discussions with Adrian, David and Eric.
>

Thanks for bringing this back to the list. The original thread was
getting quite long.

> ---------------------------
> Problem:
> A single class can be used in multiple source files and the DI (Debug Info) class is included in multiple bc files. The duplication of
> class definitions causes blow-up in # of MDNodes, # of DIEs, leading to large memory requirement.
>
> As an example, SPEC xalancbmk requires 7GB of memory when compiled with -flto -g.
> With a preliminary implementation of type uniquing, the memory usage will be down to 2GB.
>
> In order to unique types, we have to break cycles in the MDNodes.
>
> A simple struct definition
> struct Base {
>  int a;
> };
> can cause cycles in MDNodes:
> !12 = metadata !{i32 786451, metadata !13, null, metadata !"Base", i32 1, i64 32, i64 32, i32 0, i32 0, null, metadata !14, i32 0, null, null} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
> !14 = metadata !{metadata !15, metadata !16}
> !15 = metadata !{i32 786445, metadata !13, metadata !12, metadata !"a", i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line 2, size 32, align 32, offset 0] [from int]
> !16 = metadata !{i32 786478, metadata !13, metadata !12, metadata !"Base", metadata !"Base", metadata !"", i32 1, metadata !17, i1 false, i1 false, i32 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !20, i32 1} ; [ DW_TAG_subprogram ] [line 1] [Base]
>
> Cycles: !12 -- !14 -- !15 -- !12
>             !12 -- !14 -- !16 -- !12
>
> These cycles make it hard to unique the same struct used in two bc files.
>
> ---------------------------
> How to fix:
>
> We attach a hash value to types to help type uniquing and we also replace references to types with their hash values.
> For the above struct "Base", we now have the following MDNodes:
> !4 = metadata !{i32 786451, metadata !5, null, metadata !"Base", i32 1, i64 32, i64 32, i32 0, i32 0, null, metadata !6, i32 0, i32 0, null, i32 915398439} ; [ DW_TAG_structure_type ] [Base] [line 1, size 32, align 32, offset 0] [from ]
> !6 = metadata !{metadata !7, metadata !9}
> !7 = metadata !{i32 786445, metadata !5, i32 915398439, metadata !"a", i32 2, i64 32, i64 32, i64 0, i32 0, metadata !8} ; [ DW_TAG_member ] [a] [line 2, size 32, align 32, offset 0] [from int]
> !9 = metadata !{i32 786478, metadata !5, i32 915398439, metadata !"Base", metadata !"Base", metadata !"", i32 1, metadata !10, i1 false, i1 false, i32 0, i32 0, null, i32 320, i1 false, null, null, i32 0, metadata !13, i32 1} ; [ DW_TAG_subprogram ] [line 1] [Base]
>
> Note that the cycles are gone and !4 has a hash value of 915398439, and the references to !4 are replaced with 915398439.
> Thanks Eric for suggesting replacing MD reference with a hash value.

In particular I recommended this:

a) For C++ odr replace it with the "hash" that's just a string
representing the type name.
b) For Internal C++ types and all C types replace it with a string
that's a concatenation of the type name and the name of the compile
unit.

>
> There are a few issues:
> 1> How to generate the hash for a given type?
> With C++'s ODR, it should be enough by using the context and the name for non-internal c++ types.
> For internal c++ types and types of other languages, hash will not be used.

Explain this?

> My current implementation is to add a few static member functions in MDNode to profile DI nodes differently.
> + /// If the array of Vals is for debug info, profile it specially and return true.
> + /// If the DI node has a hash value, generate the profile using only the hash value and the declaration flag.
> +  static bool profileDebugInfoNode(ArrayRef<Value*> Vals, FoldingSetNodeID &ID);
>
> + /// If the MDNode is for debug info, profile it specially and return true.
> + /// If the DI node has a hash value, generate the profile using only the hash value and the declaration flag.
> +  static bool profileDebugInfoNode(const MDNode *M, FoldingSetNodeID &ID);
>
> + /// Given a hash value and a flag, generate the profile for later lookup.
> +  static bool profileDebugInfoNode(unsigned Hash, bool Declaration, FoldingSetNodeID &ID);
>
> These static functions are called in Metadata.cpp:
> void MDNode::Profile(FoldingSetNodeID &ID) const {
> +  if (profileDebugInfoNode(this, ID))
> +    return;
> +
>
> There are other examples of these in MDNode for handling of specific metadata.
> /// Methods for metadata merging.
> static MDNode *getMostGenericTBAA(MDNode *A, MDNode *B);
> static MDNode *getMostGenericFPMath(MDNode *A, MDNode *B);
> static MDNode *getMostGenericRange(MDNode *A, MDNode *B);
>
> Comments are welcome on whether this violates any layering rule.
>

As I've said many times in email, I don't think this is a good idea
and would prefer either a or b below. a) is a much simpler solution.

> Other choices are:
> a> Keep a map in DwarfDebug
>     Keep in mind that the map is used at many stages, and it has to be in sync with MDNodeSet.
> b> Generalize MDNode to be aware of hash (David can provide more details)
> c> Extend MDNode to DINode and modify streamers (bitcode reader|writer, ll reader|writer) to be aware of DINode
>     We can provide DINode::get(…) to create a DINode. DINode can have its own Profile function.
> Other suggestions are welcome.
>

a or b please.

> Transition from current DI Metadata:
> To have a smooth transition, I will add a flag "-gtype-hashing" for the type uniquing work and turn it on by default when we are ready.
>

I'd prefer just make the change to have the front end emit the "hash"
(it's not really a hash, it's just a string).

> -----------------------------
> Patches:
> Expect the following patches:
> 1> add flag -gtype-hashing
> 2> add hash field to DI types
> 3> modify DIBuilder to use hash instead of MD reference
> 4> related to issue 3

These can all be a single patch since it shouldn't be very large if we
go with a) above. If we go with b) then the MDNode work should be done
in isolation first and then the debug info on top of it.

> 5> backend change (in DwarfDebug|CompileUnit) to support types shared among compile units
>     requires gdwarf-2 gdwarf-3 gdwarf-4 support for issues related to ref_addr

#5 can and should be done before the rest of them.

> All changes should be local to debug info classes except patch #4.

What's patch #4?

-eric