[cfe-dev] Plans for module debugging

Mon Nov 24 09:22:03 PST 2014

On Fri, Nov 21, 2014 at 5:52 PM, Adrian Prantl <aprantl at apple.com> wrote:

> Plans for module debugging
> ==========================
>
> I recently had a chat with Eric Christopher and David Blaikie to discuss
> ideas for debug info for Clang modules and this is what we came up with.
>
> Goals
> -----
>
> Clang modules [1], (and their siblings C++ modules and precompiled header
> files) are a method for improving compile time by making the serialized AST
> for commonly-used headers files directly available to the compiler.
>
> Currently debug info is totally oblivious to this, when the developer
> compiles a file that uses a type from a module, clang simply emits a copy
> of the full definition (some exceptions apply for C++) of this type in
> DWARF into the debug info section of the resulting object file. That's a
> lot of copies.
>
> The key idea is to emit DWARF for types defined in modules only once, and
> then only emit references to these types in all the individual compile
> units that import this module. We are going to build on the split DWARF and
> type unit facilities provided by DWARF for this. DWARF consumers can follow
> the type references into module debug info section quite similar to how
> they resolve types in external type units today. Additionally, the format
> will allow consumers that support clang modules natively (such as LLDB) to
> directly look up types in the module, without having to go through the
> usual translation from AST to DWARF and back to AST.
>
> The primary benefit from doing all this is performance. This change is
> expected to reduce the size of the debug info in object files significantly
> by
> - emitting only references to the full types and thus
> - implicitly uniquing types that are defined in modules.
> The smaller object files will result in faster compile times and faster
> llvm::Module load times when doing LTO. The type uniquing will also result
> in significantly smaller debug info for the finished executables,
> especially for C and Objective-C, which do not support ODR-based type
> uniquing. This comes at the price of longer initial module build times, as
> debug info is emitted alongside the module.
>
> Design
> ------
>
> Clang modules are designed to be ephemeral build artifacts that live in a
> shared module cache. Compiling a source file that imports `MyModule`
> results in `Module.pcm` to be generated to the module cache directory,
> which contains the serialized AST of the declarations found in the header
> files that comprise the module.
>
> We will change the binary clang module format to became a container (ELF,
> Mach-O, depending on the platform). Inside the container there will be
> multiple sections: one containing the serialized AST, and ones containing
> DWARF5 split debug type information for all types defined in the module
> that can be encoded in DWARF. By virtue of using type units, each type is
> emitted into its own type unit which can be identified via a unique type
> signature. DWARF consumers can use the type signatures to look up type
> definitions in the module debug info section. For module-aware consumers
> (LLDB), we will add an index that maps type signatures directly to an
> offset in the AST section.
>
> For an object file that was built using modules, we need to record the
> fact that a module has been imported. To this end, we add a
> DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
> the split DWARF inside the module. Similar to split DWARF objects, the
> module will be identified by its filename and a checksum. The imported unit
> also contains a couple of extra attributes holding all the information
> necessary to recreate the module in case the module cache has been flushed.
> Platforms that treat modules as an explicit build artifact do not have this
> problem. In the .debug_info section all types that are defined in the
> module are referenced via their unique type signature using
> DW_FORM_ref_sig8, just as they would be if this were types from a regular
> DWARF type unit.
>
> Example
> -------
>
> Let's say we have a module `MyModule` that defines a type `MyStruct`::
>  $ cat foo.c
>  #include <MyModule.h>
>  MyStruct x;
>
> when compiling `foo.c` like this::
>  clang -fmodules -gmodules foo.c -c
>
> clang produces `foo.o` and an ELF or Mach-O container for the module::
>  /path/to/module-cache/MyModule.pcm
>
> In the module container, we have a section for the serialized AST and a
> split DWARF sections for the debug type info. The exact format is likely
> still going to evolve a little, but this should give a rough idea::
>
>  MyModule.pcm:
>   .debug_info.dwo:
>     DW_TAG_compile_unit
>       DW_AT_dwo_name ("/path/to/MyModule.pcm")
>       DW_AT_dwo_id   ([unique AST signature])
>
>     DW_TAG_type_unit ([hash for MyStruct])
>        DW_TAG_structure_type
>           DW_AT_signature ([hash for MyStruct])
>           DW_AT_name “MyStruct”
>           ...
>
>   .debug_abbrev.dwo:
>     // abbrevs referenced by .debug_info.dwo
>   .debug_line.dwo:
>     // filenames referenced by .debug_info.dwo
>   .debug_str.dwo:
>     // strings referenced by .debug_info.dwo
>
>   .ast
>     // Index at the top of the AST section sorted by hash value.
>     [hash for MyStruct] -> [offset for MyStruct in this section]
>     ...
>     // Serialized AST follows
>     ...
>
> The debug info in foo.o will look like this::
>
>  .debug_info.dwo
>

(so if this goes in debug_info.dwo then it would be in foo.dwo, not
foo.o... but I had some further thoughts about this... )

So - imagining a future in which modules are real object files that get
linked into the final executable because they contain things like
definitions of linkonce_odr functions (so that any object file that has all
the linkonce_odr calls inlined doesn't have to carry around a (probably
duplicate) definition of the function) - then that object file could also
contain the skeleton CU unit (& associated line table, string table, etc)
for not only these functions, but for all the types, etc, as well.

In that world, we would have exactly fission, nothing new (no two-level
fission, where some static-data-only skeletons appear in the .dwo file and
the skeletons with non-static data (ie: with relocations, such as those
describing concrete function definitions or global variables) appear in the
.o file).

We can reach that same output today by adding these skeletons into the .o
file (in debug_info, not debug_info.dwo) and using comdat to unique them
during linking.

This option would be somewhat wasteful for now (& in the future for any
module that had /no/ concrete code that could be kept in the module - such
as would be the case in pure template libraries with no explicit
instantiation decl/defs, etc) because it would put module references in the
.o, but it would mean not having to teach tools new fission tricks
immediately.

Then, if we wanted to add an optimization of double-indirection fission
(having skeleton CUs in .dwo files that reference further .dwo files) we
could do that as a separate step on top.

It's just a thought - Maybe it's an unnecessary extra step and we should
just go for the double-indirection from the get-go, I'm not sure?

Opinions?

   DW_TAG_compile_unit
>       // For DWARF consumers
>       DW_AT_dwo_name ("/path/to/module-cache/MyModule.pcm")
>       DW_AT_dwo_id   ([unique AST signature])
>
>       // For LLDB / dsymutil so they can recreate the module
>       DW_AT_name “MyModule"
>       DW_AT_LLVM_system_root "/"
>       DW_AT_LLVM_preprocessor_defines  "-DNDEBUG"
>       DW_AT_LLVM_include_path "/path/to/MyModule.map"
>
>  .debug_info
>    DW_TAG_compile_unit
>      DW_TAG_variable
>        DW_AT_name "x"
>        DW_AT_type (DW_FORM_ref_sig8) ([hash for MyStruct])
>
>
> Type signatures
> ---------------
>
> We are going to deviate from the DWARF spec by using a more efficient
> hashing function that uses the type's unique mangled name and the name of
> the module as input. For languages that do not have mangled type names or
> an ODR, we will use the unique identifiers produces by the clang indexer
> (USRs) as input instead.
>
> Extension: Replacing type units with a more efficient storage format
> --------------------------------------------------------------------
>
> As an extension to this proposal, we are thinking of replacing the type
> units within the module debug info with a more efficient format: Instead of
> emitting each type into its own type unit (complete with its entire
> declcontext), it would be much more more efficient to emit one large bag of
> DWARF together with an index that maps hash values (type signatures) to DIE
> offsets.
>
> Next steps
> ----------
>
> In order to implement this, the next steps would be as follows:
> 1. Change the clang module format to be an ELF/Mach-O container.
> 2. Teach clang to emit debug info for module types (e.g., by passing an
> empty compile unit with retained types to LLVM) into the module container.
> 3a. Add a -gmodules switch to clang that triggers the emission of type
> signatures for types coming from a module.
> 3b. Implement type-signature-based lookup in llvm-dsymutil and lldb.
> 4a. Emit an index that maps type signatures to AST section offsets into
> the module container.
> 4b. Implement direct loading of AST types in lldb.
> 5a. Improve the efficiency by replace type units in the module debug info
> with a lookup table that maps type signatures to DIE offsets.
> 5b. Support this format in lldb and llvm-dsymutil.
>
> Let me know what you think!
>
> cheers,
> Adrian
>
> [1] For more details about clang modules see
> http://clang.llvm.org/docs/Modules.html and
> http://clang.llvm.org/docs/PCHInternals.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20141124/77ff7f58/attachment.html>