[cfe-dev] Plans for module debugging

David Blaikie dblaikie at gmail.com
Mon Dec 1 11:19:58 PST 2014


On Mon, Dec 1, 2014 at 11:18 AM, David Blaikie <dblaikie at gmail.com> wrote:

>
>
> On Mon, Dec 1, 2014 at 10:57 AM, Adrian Prantl <aprantl at apple.com> wrote:
>
>>
>> On Dec 1, 2014, at 10:50 AM, Ben Langmuir <blangmuir at apple.com> wrote:
>>
>>
>> On Dec 1, 2014, at 10:41 AM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>
>> On Dec 1, 2014, at 10:32 AM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>
>> On Dec 1, 2014, at 10:27 AM, Ben Langmuir <blangmuir at apple.com> wrote:
>>
>>
>> On Nov 25, 2014, at 5:25 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>
>> On Nov 24, 2014, at 4:55 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>
>> On Fri, Nov 21, 2014 at 5:52 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>> Plans for module debugging
>>> ==========================
>>>
>>> I recently had a chat with Eric Christopher and David Blaikie to discuss
>>> ideas for debug info for Clang modules and this is what we came up with.
>>>
>>> Goals
>>> -----
>>>
>>> Clang modules [1], (and their siblings C++ modules and precompiled
>>> header files) are a method for improving compile time by making the
>>> serialized AST for commonly-used headers files directly available to the
>>> compiler.
>>>
>>> Currently debug info is totally oblivious to this, when the developer
>>> compiles a file that uses a type from a module, clang simply emits a copy
>>> of the full definition (some exceptions apply for C++) of this type in
>>> DWARF into the debug info section of the resulting object file. That's a
>>> lot of copies.
>>>
>>> The key idea is to emit DWARF for types defined in modules only once,
>>> and then only emit references to these types in all the individual compile
>>> units that import this module. We are going to build on the split DWARF and
>>> type unit facilities provided by DWARF for this. DWARF consumers can follow
>>> the type references into module debug info section quite similar to how
>>> they resolve types in external type units today. Additionally, the format
>>> will allow consumers that support clang modules natively (such as LLDB) to
>>> directly look up types in the module, without having to go through the
>>> usual translation from AST to DWARF and back to AST.
>>>
>>> The primary benefit from doing all this is performance. This change is
>>> expected to reduce the size of the debug info in object files significantly
>>> by
>>> - emitting only references to the full types and thus
>>> - implicitly uniquing types that are defined in modules.
>>> The smaller object files will result in faster compile times and faster
>>> llvm::Module load times when doing LTO. The type uniquing will also result
>>> in significantly smaller debug info for the finished executables,
>>> especially for C and Objective-C, which do not support ODR-based type
>>> uniquing. This comes at the price of longer initial module build times, as
>>> debug info is emitted alongside the module.
>>>
>>> Design
>>> ------
>>>
>>> Clang modules are designed to be ephemeral build artifacts that live in
>>> a shared module cache. Compiling a source file that imports `MyModule`
>>> results in `Module.pcm` to be generated to the module cache directory,
>>> which contains the serialized AST of the declarations found in the header
>>> files that comprise the module.
>>>
>>> We will change the binary clang module format to became a container
>>> (ELF, Mach-O, depending on the platform). Inside the container there will
>>> be multiple sections: one containing the serialized AST, and ones
>>> containing DWARF5 split debug type information for all types defined in the
>>> module that can be encoded in DWARF. By virtue of using type units, each
>>> type is emitted into its own type unit which can be identified via a unique
>>> type signature. DWARF consumers can use the type signatures to look up type
>>> definitions in the module debug info section. For module-aware consumers
>>> (LLDB), we will add an index that maps type signatures directly to an
>>> offset in the AST section.
>>>
>>> For an object file that was built using modules, we need to record the
>>> fact that a module has been imported. To this end, we add a
>>> DW_TAG_compile_unit into a COMDAT .debug_info.dwo section that references
>>> the split DWARF inside the module. Similar to split DWARF objects, the
>>> module will be identified by its filename and a checksum. The imported unit
>>> also contains a couple of extra attributes holding all the information
>>> necessary to recreate the module in case the module cache has been flushed.
>>
>>
>> How does the debugging experience work in this case? When do you trigger
>> the (possibly-lengthy) rebuild of the source in order to recreate the DWARF
>> for the module (is it possible to delay that until the information is
>> needed)?
>>
>>
>> The module debugging scenario is primarily aimed at providing a
>> better/faster edit-compile-debug cycle. In this scenario, the module would
>> most likely still be in the cache. In a case were the binary was build so
>> long ago that the module cache has since been flushed it is generally more
>> likely the the user also used a DWARF linking step (such as dsymutil on
>> Darwin, and maybe dwz on Linux?) because they did a release/archive build
>> which would just copy the DWARF out of the module and store it alongside
>> the binary. For this reason I’m not very concerned about the time necessary
>> for rebuilding the module. But this is all very platform-specific, and
>> different platforms may need different defaults.
>>
>>
>> This description is in terms of building a module that has gone missing,
>> but just to be clear: a modules-aware debugger probably also needs to
>> rebuild modules that have gone out of date, such as when one of their
>> headers is modified.
>>
>>
>> In this case were the module is out of date, the debugger should probably
>> fall back to the DWARF types, because it cannot guarantee that the
>> modifications to the header files did not change the types it wants to look
>> up.
>>
>>
>> Sorry, I just realized that this doesn’t make any sense if the DWARF is
>> stored in the module. The behavior should be:
>> 1. If the module is missing, recreate the module.
>> 2. If the module signature does not match the signature in the .o file,
>> either print a large warning that types from that module may be bogus, or
>> categorically refuse to use them.
>>
>>
>> Maybe this is described elsewhere, but what is the “signature” being used
>> here?  Assuming it depends on the detailed contents of the serialized AST:
>> currently ASTWriter output is nondeterministic and things like the ID#s for
>> identifiers, types, etc. will change every time you build the module; until
>> that gets fixed, we would always hit case (2).
>>
>>
>> I was actually hoping that we could rely on deterministic output from
>> clang. If it is infeasible make ASTWriter output deterministic, we can fall
>> back to something like the DWARF dwo_id signature here.
>>
>
> I believe it's Richard's plan to make modules output deterministic, but
> it's just not the highest pole for him right now. I'm not sure how
> important it is to you guys - nor how difficult it is to do.
>
> (Richard, do correct me if I'm wrong there... )
>

That said - as long as the ID remains the same, it might not be important
that the module output remain identical.

The module will have the table from type hash (which will be stable - the
hash of a mangled, ODR name, etc) to whatever is necessary to identify the
type (or other entity) in the module - so since that goes in the module, it
can vary per module build, so long as the entity/type hash remains the same.

(& the module identifier (even if the contents change) remains the same -
which I would expect it has to so that module builds don't get confused by
out of date or too-new modules, etc)

- David


>
>
> - David
>
>
>>
>> -- adrian
>>
>>
>>
>> For long-term debugging users are expected to use a DWARF linker
>> (dsymutil, dwz), which archives all types in a future-proof format (DWARF).
>>
>> -- adrian
>>
>>
>>
>> Delaying the module DWARF output until needed (maybe even by the
>> debugger!) is an interesting idea. We should definitely measure how
>> expensive it is to emit DWARF for an entire module with of types to see if
>> this is worthwhile.
>>
>> How much knowledge does the debugger have/need of Clang's modules to do
>> this? Are we just embedding an arbitrary command that can be run to rebuild
>> the .dwo if it's missing? And if so, how do we make that safe when (say)
>> root attaches a debugger to an arbitrary process?
>>
>>
>> I think it is reasonable to assume that a consumer that can make use of
>> clang modules also knows how to rebuild clang modules, which is why the
>> example only contained the name of the module, sysroot, include path, and
>> defines; not an arbitrary command. On platforms were the debugger does not
>> understand clang modules, the whole problem can be dodged by treating the
>> modules as explicit build artifacts.
>>
>>
>> You are probably already aware, but you will need a bunch more
>> information (language options, target options, header search options) to
>> rebuild a module.
>>
>>
>> Thanks, language options and target options were absent from the list
>> previously!
>>
>> -- adrian
>>
>>
>>
>>
>> Platforms that treat modules as an explicit build artifact do not have
>>> this problem. In the .debug_info section all types that are defined in the
>>> module are referenced via their unique type signature using
>>> DW_FORM_ref_sig8, just as they would be if this were types from a regular
>>> DWARF type unit.
>>>
>>> Example
>>> -------
>>>
>>> Let's say we have a module `MyModule` that defines a type `MyStruct`::
>>>  $ cat foo.c
>>>  #include <MyModule.h>
>>>  MyStruct x;
>>>
>>> when compiling `foo.c` like this::
>>>  clang -fmodules -gmodules foo.c -c
>>>
>>> clang produces `foo.o` and an ELF or Mach-O container for the module::
>>>  /path/to/module-cache/MyModule.pcm
>>>
>>> In the module container, we have a section for the serialized AST and a
>>> split DWARF sections for the debug type info. The exact format is likely
>>> still going to evolve a little, but this should give a rough idea::
>>>
>>>  MyModule.pcm:
>>>   .debug_info.dwo:
>>>     DW_TAG_compile_unit
>>>       DW_AT_dwo_name ("/path/to/MyModule.pcm")
>>>       DW_AT_dwo_id   ([unique AST signature])
>>>
>>>     DW_TAG_type_unit ([hash for MyStruct])
>>>        DW_TAG_structure_type
>>>           DW_AT_signature ([hash for MyStruct])
>>>           DW_AT_name “MyStruct”
>>>           ...
>>>
>>>   .debug_abbrev.dwo:
>>>     // abbrevs referenced by .debug_info.dwo
>>>   .debug_line.dwo:
>>>     // filenames referenced by .debug_info.dwo
>>>   .debug_str.dwo:
>>>     // strings referenced by .debug_info.dwo
>>>
>>>   .ast
>>>     // Index at the top of the AST section sorted by hash value.
>>>     [hash for MyStruct] -> [offset for MyStruct in this section]
>>>     ...
>>>     // Serialized AST follows
>>>     ...
>>>
>>> The debug info in foo.o will look like this::
>>>
>>>  .debug_info.dwo
>>>    DW_TAG_compile_unit
>>>       // For DWARF consumers
>>>       DW_AT_dwo_name ("/path/to/module-cache/MyModule.pcm")
>>>       DW_AT_dwo_id   ([unique AST signature])
>>>
>>>       // For LLDB / dsymutil so they can recreate the module
>>>       DW_AT_name “MyModule"
>>>       DW_AT_LLVM_system_root "/"
>>>       DW_AT_LLVM_preprocessor_defines  "-DNDEBUG"
>>>       DW_AT_LLVM_include_path "/path/to/MyModule.map"
>>>
>>>  .debug_info
>>>    DW_TAG_compile_unit
>>>      DW_TAG_variable
>>>        DW_AT_name "x"
>>>        DW_AT_type (DW_FORM_ref_sig8) ([hash for MyStruct])
>>>
>>>
>>> Type signatures
>>> ---------------
>>>
>>> We are going to deviate from the DWARF spec by using a more efficient
>>> hashing function that uses the type's unique mangled name and the name of
>>> the module as input.
>>
>>
>> Why do you need/want the name of the module here? Modules are not a
>> namespacing mechanism. How would you compute this name when the same type
>> is defined in multiple imported modules?
>>
>>
>> Great point! I’m mostly concerned about non-ODR languages ...
>>
>>
>> For languages that do not have mangled type names or an ODR,
>>
>>
>> The people working on C modules have expressed an intent to apply the ODR
>> there too, so it's not clear that Clang modules will support any such
>> language in the longer term.
>>
>>
>> ... and this may be the answer to the question!
>>
>> +Doug: do Objective-C modules have an ODR?
>>
>>
>> we will use the unique identifiers produces by the clang indexer (USRs)
>>> as input instead.
>>>
>>> Extension: Replacing type units with a more efficient storage format
>>> --------------------------------------------------------------------
>>>
>>> As an extension to this proposal, we are thinking of replacing the type
>>> units within the module debug info with a more efficient format: Instead of
>>> emitting each type into its own type unit (complete with its entire
>>> declcontext), it would be much more more efficient to emit one large bag of
>>> DWARF together with an index that maps hash values (type signatures) to DIE
>>> offsets.
>>>
>>> Next steps
>>> ----------
>>>
>>> In order to implement this, the next steps would be as follows:
>>> 1. Change the clang module format to be an ELF/Mach-O container.
>>> 2. Teach clang to emit debug info for module types (e.g., by passing an
>>> empty compile unit with retained types to LLVM) into the module container.
>>> 3a. Add a -gmodules switch to clang that triggers the emission of type
>>> signatures for types coming from a module.
>>>
>>
>> Can you clarify what this flag would do? Does this turn on adding DWARF
>> to the .pcm file? Does it turn off generating DWARF for imported modules in
>> the current IR module? Both?
>>
>>
>> It would emit references to the type from imported modules instead of the
>> types themselves.
>> Since the module cache is shared, we could — depending on just expensive
>> this is — turn on DWARF generation for .pcm files by default. I’d like to
>> measure this first, though.
>>
>>
>>
>> I assume this means that the default remains that we build debug
>> information for modules as if we didn't have modules (that is, put complete
>> DWARF with the object code). Do you think that's the right long-term
>> default? I think it's possibly not.
>>
>>
>> I think you’re absolutely right about the long term. In the short term,
>> it may be better to have compatibility by default, but I don’t know what
>> the official LLVM policy on new features is, if there is one.
>>
>>
>>
>> How does this interact with explicit module builds? Can I use a module
>> built without -g in a compile that uses -g? And if I do, do I get complete
>> debug information, or debug info just for the parts that aren't in the
>> module? Does -gmodules let me choose between these?
>>
>>
>> Personally I would expect old-style (full copy of the types) debug
>> information if I build agains a module that does not have embedded debug
>> information.
>>
>> thanks,
>> adrian
>>
>>
>> 3b. Implement type-signature-based lookup in llvm-dsymutil and lldb.
>>> 4a. Emit an index that maps type signatures to AST section offsets into
>>> the module container.
>>> 4b. Implement direct loading of AST types in lldb.
>>> 5a. Improve the efficiency by replace type units in the module debug
>>> info with a lookup table that maps type signatures to DIE offsets.
>>> 5b. Support this format in lldb and llvm-dsymutil.
>>>
>>> Let me know what you think!
>>>
>>> cheers,
>>> Adrian
>>>
>>> [1] For more details about clang modules see
>>> http://clang.llvm.org/docs/Modules.html and
>>> http://clang.llvm.org/docs/PCHInternals.html
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>>
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20141201/ea626029/attachment.html>


More information about the cfe-dev mailing list