[PATCH] Have clang list the imported modules in the debug info

Mon May 4 12:27:31 PDT 2015

> 
>> On May 4, 2015, at 11:38 AM, David Blaikie <dblaikie at gmail.com> wrote:
>> 
>> 
>> 
>> On Mon, May 4, 2015 at 11:24 AM, Adrian Prantl <aprantl at apple.com> wrote:
>> 
>>> On May 4, 2015, at 10:53 AM, David Blaikie <dblaikie at gmail.com> wrote:
>>> 
>>> 
>>> 
>>> On Fri, May 1, 2015 at 8:52 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>> 
>>>>> On May 1, 2015, at 5:25 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 1, 2015 at 5:19 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>> 
>>>>>> On May 1, 2015, at 4:55 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 1, 2015 at 4:39 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>>> 
>>>>>> > On May 1, 2015, at 10:01 AM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Fri, May 1, 2015 at 9:52 AM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>>> >>
>>>>>> >>> On May 1, 2015, at 9:23 AM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Thu, Apr 30, 2015 at 5:21 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>>> >>>
>>>>>> >>> > On Apr 30, 2015, at 4:55 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Thu, Apr 30, 2015 at 4:31 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>>> >>> >>
>>>>>> >>> >> > On Mar 19, 2015, at 5:37 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> > On Thu, Mar 19, 2015 at 5:24 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>>>> >>> >> >>
>>>>>> >>> >> >> > On Mar 16, 2015, at 2:55 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >> On Mon, Mar 16, 2015 at 2:45 PM, Robinson, Paul <Paul_Robinson at playstation.sony.com> wrote:
>>>>>> >>> >> >> > Beyond the above (that using a new tag would mean this would go from 'free' to 'not free' for GDB) having a new top level tag is pretty substantial (we only have two at the moment, and with our talk of modules being a "bag of dwarf" might go back to having one top level tag? (it's not clear to me from DWARF4 whether DW_TAG_module is currently a top-level tag, I don't think it is?)
>>>>>> >>> >> >> >
>>>>>> >>> >> >> >> The .debug_info section contains one or more compilation units, partial units, or in DWARF 5, type units.  DW_TAG_module isn't a unit, if you want it to be handled independently then it would need to be wrapped in a DW_TAG_partial_unit.  You would probably then use DW_TAG_imported_unit to refer to it, rather than DW_TAG_imported_module.
>>>>>> >>> >> >> >>
>>>>>> >>> >> >> >
>>>>>> >>> >> >> > This makes a fair bit of sense - though the terminology's never going to quite line up with modules, I suspect, and this would still require modifying existing consumers (well, GDB) that can handle split-dwarf today, I suspect (not sure how it'd handle partial_unit - maybe that does work? - and still don't know how existing consumers would handle imported_unit either - could be worth some testing, as it sounds sort of right out of several less right options).
>>>>>> >>> >> >>
>>>>>> >>> >> >> Thanks for all the input so far!
>>>>>> >>> >> >> To concretize this end of the discussion up let’s sketch some dwarf of how this could look like in practice.
>>>>>> >>> >> >>
>>>>>> >>> >> >> ELF (no imports)
>>>>>> >>> >> >> ----------------
>>>>>> >>> >> >>
>>>>>> >>> >> >> On ELF or COFF a foo.c referencing types from the module Foundation looks like this:
>>>>>> >>> >> >>
>>>>>> >>> >> >> .debug_info:
>>>>>> >>> >> >>   DW_TAG_compile_unit
>>>>>> >>> >> >>     DW_AT_name(“foo.c”)
>>>>>> >>> >> >>
>>>>>> >>> >> >> .debug_info.dwo (on ELF: group 0x1234ABCDE, comdat)
>>>>>> >>> >> >>   DW_TAG_partial_unit
>>>>>> >>> >> >
>>>>>> >>> >> > For now I'd suggest we use compile_unit - that way it'll just work with existing split-dwarf consumers. We can see about standardizing a top-level DW_TAG_module or using DW_TAG_partial_unit here later, perhaps? I'm not sure.
>>>>>> >>> >> >
>>>>>> >>> >> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/Foundation.pcm”)
>>>>>> >>> >> >>     DW_AT_dwo_id(“0x1234ABCDE”)
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >> Side question: Is .debug_info.dwo the right section to put the module skeleton in, or should it be a .debug_info section like normal fission skeletons?
>>>>>> >>> >> >
>>>>>> >>> >> > Skeletons go in .debug_info, the dwo sections are just for the .dwo file (or the module file, in our new case - the extension isn't actually important).
>>>>>> >>> >> >
>>>>>> >>> >> > It might be worth you compiling an example or two of split-dwarf to see how this all works hands-on.
>>>>>> >>> >> >
>>>>>> >>> >> >> Mach-O (no comdat, no imports)
>>>>>> >>> >> >> ------------------------------
>>>>>> >>> >> >>
>>>>>> >>> >> >> Mach-O doesn’t do comdat, so with -split-dwarf=Disable (not sure if that option is the best discriminator) this could look like:
>>>>>> >>> >> >>
>>>>>> >>> >> >> .debug_info:
>>>>>> >>> >> >>   DW_TAG_compile_unit
>>>>>> >>> >> >>     DW_AT_name(“foo.c”)
>>>>>> >>> >> >>   DW_TAG_partial_unit
>>>>>> >>> >> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/Foundation.pcm”)
>>>>>> >>> >> >>     DW_AT_dwo_id(“0x1234ABCDE”)
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >> Mach-O (no comdat, with imports)
>>>>>> >>> >> >> ------------------------------
>>>>>> >>> >> >>
>>>>>> >>> >> >> If we add the module import information to this, we get:
>>>>>> >>> >> >>
>>>>>> >>> >> >> .debug_info:
>>>>>> >>> >> >>   DW_TAG_compile_unit
>>>>>> >>> >> >>     DW_AT_name(“foo.c”)
>>>>>> >>> >> >>     DW_TAG_imported_module
>>>>>> >>> >> >>       DW_AT_import(DW_FORM_ref_addr 0x10)
>>>>>> >>> >> >
>>>>>> >>> >> > Since we got went down the tangent of explaining split-dwarf many emails ago, I've forgotten (& can't readily find) what we were discussing about what ways the imported_module could work.
>>>>>> >>> >> >
>>>>>> >>> >> > The simplest representation I can think of would be to have it reference, by signature, the module unit (whatever tag it uses) - DW_FORM_ref_sig8, seems the simplest thing to do.
>>>>>> >>> >> >
>>>>>> >>> >> >>
>>>>>> >>> >> >>   DW_TAG_partial_unit
>>>>>> >>> >> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/Foundation.pcm”)
>>>>>> >>> >> >>     DW_AT_dwo_id(“0x1234ABCDE”)
>>>>>> >>> >> >>
>>>>>> >>> >> >> 0x10:
>>>>>> >>> >> >
>>>>>> >>> >> > This is inside the partial unit? I figured we'd just put these attributes on the top level (compile_unit, or whatever it might be later) - potentially conditionalized on platform, sure.
>>>>>> >>> >> >
>>>>>> >>> >> >>     DW_TAG_module
>>>>>> >>> >> >>       DW_AT_name(“Foundation”)
>>>>>> >>> >> >>       DW_AT_LLVM_sysroot(“/“)
>>>>>> >>> >> >>       DW_AT_LLVM_include_dir(“”)
>>>>>> >>> >> >>       DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>>> >>> >> >>       ...
>>>>>> >>> >> >>
>>>>>> >>> >> >>
>>>>>> >>> >> >> ELF (comdat, with imports)
>>>>>> >>> >> >> --------------------------
>>>>>> >>> >> >>
>>>>>> >>> >> >> But now let’s go back to ELF. Since the skeleton with the partial unit is comdat'd, I assume that this breaks the FORM_ref_addr used in the DW_AT_import. We could reuse the module hash as a signature for the module:
>>>>>> >>> >> >>
>>>>>> >>> >> >> .debug_info:
>>>>>> >>> >> >>   DW_TAG_compile_unit
>>>>>> >>> >> >>     DW_AT_name(“foo.c”)
>>>>>> >>> >> >>     DW_TAG_imported_module
>>>>>> >>> >> >>       DW_AT_import(DW_FORM_ref_addr 0x1234ABCDE)
>>>>>> >>> >> >
>>>>>> >>> >> > Still only really need these imported_modules for lldb, right? I'd consider having them off-by-default for non-darwin, but I'm not strictly wedded to that notion. Wouldn't mind seeing size impact numbers of some kind - if it's really fractional % increase & GDB doesn't fall over when it sees them (in whatever FORM/tag/etc we decide on) then that's not the end of the world.
>>>>>> >>> >> >
>>>>>> >>> >> > Just seems nice if the default mode is the nice, standard, split-dwarf output. Doesn't need anything fancy.
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> >> .debug_info.dwo (group 0x1234ABCDE, comdat)
>>>>>> >>> >> >>   DW_TAG_partial_unit
>>>>>> >>> >> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/Foundation.pcm”)
>>>>>> >>> >> >>     DW_AT_dwo_id(“0x1234ABCDE”)
>>>>>> >>> >> >>
>>>>>> >>> >> >>     DW_TAG_module
>>>>>> >>> >> >>       DW_AT_signature(“0x1234ABCDE”)
>>>>>> >>> >> >>       DW_AT_name(“Foundation”)
>>>>>> >>> >> >
>>>>>> >>> >> >
>>>>>> >>> >> > The thing you haven't covered is the actual .dwo sections (.debug_info.dwo (we'll probably need a simple stub compile_unit to make this correct split-dwarf) and .debug_types.dwo being important - but all the supporting .dwo sections will be necessary) that go in the module file.
>>>>>> >>> >> >
>>>>>> >>> >> >> This is bending the definition of DW_AT_signature, but I guess it could be made to work. Or we could say that for now, users have to choose between the comdat optimization and having the module imports recorded in Dwarf, since GDB wouldn’t know what to do with that information anyway.
>>>>>> >>> >>
>>>>>> >>> >> Sorry for the long delay. Here’s a more complete example that should include all the suggestions made so far. For context I also included external type references in the example although admittedly this is a bit out of scope for this thread:
>>>>>> >>> >>
>>>>>> >>> >> ELF (typeunits, comdats, with imports)
>>>>>> >>> >> --------------------------------------
>>>>>> >>> >>
>>>>>> >>> >> On ELF or COFF a bar.c referencing type Foo from the module FooLib looks like this:
>>>>>> >>> >>
>>>>>> >>> >> bar.o
>>>>>> >>> >> ~~~~~
>>>>>> >>> >>
>>>>>> >>> >> // To keep this example focussed/readable, I'm assuming that bar.o itself was not compiled with fission.
>>>>>> >>> >> .debug_info:
>>>>>> >>> >>   DW_TAG_compile_unit
>>>>>> >>> >>     DW_AT_name(“bar.c”)
>>>>>> >>> >>     ...
>>>>>> >>> >>
>>>>>> >>> >>     DW_TAG_imported_module // <- This could be optional on ELF.
>>>>>> >>> >>       DW_AT_import [DW_FORM_ref_sig8] (0xABCD1234)
>>>>>> >>> >>
>>>>>> >>> >>     DW_TAG_variable
>>>>>> >>> >>       DW_AT_name(“MyFoo”)
>>>>>> >>> >>       DW_AT_type [DW_FORM_ref4] 0x20
>>>>>> >>> >> 0x20:
>>>>>> >>> >>     DW_TAG_structure_type
>>>>>> >>> >>       DW_AT_declaration (true)
>>>>>> >>> >>       DW_AT_signature [DW_FORM_ref_sig8] (0xF00)
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> // Split DWARF skeleton CU for the module Foo.
>>>>>> >>> >>   DW_TAG_compile_unit
>>>>>> >>> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>>> >>> >>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>> >>> >>     ...
>>>>>> >>> >>
>>>>>> >>> >> // Comdat’d partial unit containing the optional module descriptor.
>>>>>> >>> >> .debug_info, group 0xABCD1234, comdat
>>>>>> >>> >>   DW_TAG_partial_unit
>>>>>> >>> >>     DW_TAG_module
>>>>>> >>> >>       DW_AT_name(“FooLib”)
>>>>>> >>> >>       DW_AT_LLVM_sysroot(“/“)
>>>>>> >>> >>       DW_AT_LLVM_include_dirs(“-I/path”)
>>>>>> >>> >>       DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>>> >>> >>       ...
>>>>>> >>> >>
>>>>>> >>> >> FooLib-XYZ.pcm
>>>>>> >>> >> ~~~~~~~~~~~~~~
>>>>>> >>> >>
>>>>>> >>> >> .debug_info.dwo
>>>>>> >>> >>   DW_TAG_compile_unit
>>>>>> >>> >>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>> >>> >>     ...
>>>>>> >>> >>
>>>>>> >>> >> // Type unit for the type Foo.
>>>>>> >>> >> .debug_types.dwo, group 0xF00, comdat
>>>>>> >>> >>   DW_TAG_type_unit
>>>>>> >>> >>     DW_TAG_structure_type
>>>>>> >>> >>       DW_AT_name (“Foo”)
>>>>>> >>> >>       ...
>>>>>> >>> >>
>>>>>> >>> >>
>>>>>> >>> >> I think it awkward to have both the skeleton compile_unit in .debug_info and the partial_unit containing the TAG_module. Personally I’d prefer putting the TAG_module into the skeleton CU and then just refer to it via a FORM_ref_addr; but if we want to put the TAG_module into a comdat section, it looks like that’s what’s necessary.
>>>>>> >>> >
>>>>>> >>> > It's been a while & I've probably lost all the context, but I think my original theory was to have the skeleton compile_unit be comdat'd so they'd deduplicate on linking (so we'd only have one reference to the module.dwo in the linked binary). I don't recall there being a need for a separate partial_unit - I imagine we'd just put the LLDB/LLVM extension attributes on the skeleton compile_unit and expect debuggers that didn't understand them, to ignore them.
>>>>>> >>> >
>>>>>> >>> > Was there some reason this didn't work/make sense? Because you need a DW_TAG_module to import with DW_TAG_imported_module?
>>>>>> >>> Using DW_TAG_module was the best practice that was recommended on dwarf-discuss.
>>>>>> >>>
>>>>>> >>> Did they have any ideas on how to reference it without duplicating it in every CU?
>>>>>> >>
>>>>>> >> We didn’t touch the deduplication issue.
>>>>>> >>
>>>>>> >>> Once we've got the "Bag O Dwarf" stuff (rather than the narrower type units) this would be easier - (I suppose we could do a partial solution/abuse of type units - use a type unit header (perhaps with Eric's merged type/compile unit work) and a DW_FORM_ref_sig8 value for the DW_AT_module in the DW_TAG_imported_module.
>>>>>> >>>
>>>>>> >>> Though I suppose if we're going to have DW_TAG_imported_module in every CU that references a module, it might not be that big of a deal to include the DW_TAG_module itself there too... while I don't care about this scheme immediately, Google's growing LLDB investment in various platforms, so I am vaguely concerned about getting this right & it's not immediately obvious to me what that right answer is.
>>>>>> >>
>>>>>> >> Maybe the best path forward is to stage this by initially putting the DW_TAG_module into the main CU and leave the deduplication as an optimization to be implemented once the bag’o dwarf is more fleshed out. This way we won’t do anything that would confuse consumers (assuming they ignore unknown tags) and the extra overhead is likely not even going to be noticeable, since all the string attributes inside the TAG_module can already be deduplicated by traditional means.
>>>>>> >
>>>>>> > Perhaps. I'd still like to think through/document what this looks like a bit more. Where the data ends up, what it's used for, etc. Sorry to draw this out.
>>>>>> >
>>>>>> > :/ *ponders*
>>>>>> 
>>>>>> 
>>>>>> Let’s construct this:
>>>>>> 
>>>>>> The most straightforward representation is to not unique the TAG_module and place it into the main CU.
>>>>>> 
>>>>>> bar.o
>>>>>> ~~~~~
>>>>>> 
>>>>>> .debug_info:
>>>>>>   DW_TAG_compile_unit
>>>>>>     ...
>>>>>>     DW_TAG_imported_module
>>>>>>       DW_AT_import [DW_FORM_ref4] (0x20)
>>>>>> 0x20:
>>>>>>     DW_TAG_module
>>>>>>       DW_AT_name(“FooLib”)
>>>>>>       DW_AT_LLVM_sysroot(“/“)
>>>>>>       DW_AT_LLVM_include_dirs(“-I/path”)
>>>>>>       DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>>> 
>>>>>> Might as well put all these LLVM attributes on the skeleton CU, though - so they can be deduplicated (& just put the dwo_id in this module somewhere, perhaps just using the DW_AT_dwo_id attribute - possibly that's the only attribute the DW_TAG_module would need, ideally). Unless we need to consider the submodule issue (in which case the skeleton unit would reference the whole module but the submodules would reference/describe the respective submodules?)?
>>>>> 
>>>>> We cannot put them into the skeleton CU if the skeleton CU is going to be comdat’d, because we’d then have to refer to it via a signature and that leads us directly to the can of worms discussed in the next paragraph :-)
>>>>>>  
>>>>>>       ...
>>>>>> 
>>>>>> // Split DWARF skeleton, comdat'd.
>>>>>> .debug_info, group 0xFEDB9876, comdat
>>>>>>   DW_TAG_compile_unit
>>>>>>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>>>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>>     ...
>>>>>> 
>>>>>> On Mach-O the split DWARF skeleton would not be a comdat’d, but llvm-dsymutil can just ignore it.
>>>>>> 
>>>>>> 
>>>>>> If we want to dedup the TAG_module we need to refer to it via signature. This means we need to wrap it in a type_unit or a DWARF5 TAG_type_unit. We might as well throw it in with the skeleton CU.
>>>>>> 
>>>>>> .debug_info:
>>>>>>   DW_TAG_compile_unit
>>>>>>     ...
>>>>>>     DW_TAG_imported_module
>>>>>>       DW_AT_import [DW_FORM_ref_sig8] (0xABCD1234)
>>>>>> 
>>>>>> // Split DWARF skeleton, comdat'd.
>>>>>> .debug_info, group 0xFEDB9876, comdat
>>>>>>   DW_TAG_compile_unit
>>>>>>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>>>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>>     ...
>>>>>>     DW_TAG_type_unit (signature: 0xABCD1234)
>>>>>> 
>>>>>> Can't really put a type_unit inside a compile_unit - it'd need to be top-level with an appropriate type unit header, etc. & then we'd need two different units/headers, could still comdat them, but it's a weird abuse of type units & would probably confuse consumers. I don't know whether that's worth the effort.
>>>>> Oh right.
>>>>> 
>>>>>>  
>>>>>>       DW_TAG_module
>>>>>>         DW_AT_name(“FooLib”)
>>>>>>         DW_AT_LLVM_sysroot(“/“)
>>>>>>         DW_AT_LLVM_include_dirs(“-I/path”)
>>>>>>         DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>>>         ...
>>>>>> 
>>>>>> Now that raises the question about what happens with multiple modules within one PCM. 
>>>>>> 
>>>>>> Is the right term "submodule"? it's sort of confusing to talk about multiple modules within a pcm.
>>>>> 
>>>>> Yes, a module with nested submodules.
>>>>> http://clang.llvm.org/docs/Modules.html#submodule-declaration
>>>>> 
>>>>>>  
>>>>>> Assuming that the ELF linker is linking and deduping all the non-.dwo sections, we may loose some of the TAG_modules (if not every CU imports all submodules) in the binary, but that wouldn’t matter because the consumer would find all TAG_modules by signature in the .pcm
>>>>>> 
>>>>>> Is there any reason we need to reference the submodules individually, rather than just reference the whole module
>>>>> 
>>>>> My assumption is that an AST-aware debugger will want to import the exact submodules that were imported by the CU before dropping into the expression evaluator to replicate the environment of the CU as much as possible.
>>>>> 
>>>>> I'm just not picturing that. It seems pretty likely that a debugger user is more likely to treat the whole set of names in the program, not just those syntactically valid at that point in the source file.
>>>> 
>>>> Module imports only work if the debugger has the precise list of models imported by the current CU. Clang modules are not namespaces, and any two modules may conflict.
>>> 
>>> Right, as you say - ODR & C languages. (& I've no idea if file-scoped static/anonymous namespace things can go in C++ modules and what happens if you have conflicting modules in that regard - I guess they can conflict too? Dunno - maybe anon namespaces in C++ modules aren't allowed)
>> 
>> It sounds like a strange concept to put an anonymous namespace into a public module, but then again there exists clang/test/Modules/anon-namespace.cpp (it only uses an empty anonymous namespace, though). I’m not sure how this is meant to be used.
>> 
>>>>  
>>>> The cool thing is that with the imported modules the debugger effectively becomes clang and have the entire world visible to the current CU available, including any types and functions that never made it into the debug info because they were optimized out, or because there were uninstantiated templates that cannot be represented by DWARF.
>>>> 
>>>>> A simple example would be if I'm debugging LLVM and I'm in some generic optimization pass, but I want to cast my Instruction pointer to some specific instruction type to examine it in more detail - even though this pass doesn't care about that specific Instruction type nor include the header in which it's declared.
>>>> 
>>>> If, however, the type lookup fails, the debugger can still fall back to the traditional behavior, find the type in the accelerator tables and reconstruct it from DWARF (if it is there).
>>> 
>>> So you're going to need to implement fission (to at least some degree) support in LLDB, then? (to support the case where you haven't linked debug info with llvm-dsymutil, but you've hit one of these lookup problems where you need to cross possibly-conflicting modules)
>> 
>> Yes. Specifically, it won’t support type units, and it will look up types by name rather than by signature. (cf. the second part of http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150427/128278.html)
> 
> How are you going to reference the types in the module's fission CU without type units/signatures? Are you going to emit type declarations into the normal CU and rely on the debugger to know that these declarations can be resolved by looking elsewhere? (just without the benefit of constraining that search to just looking for a matching TU?)

If you look at the example in http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150427/128278.html, there will be an external type index (using the usual accelerator table format) that maps an external type’s UID to a pcm. In the pcm there is an extra accelerator table entry that maps UID to DIE offset.

>>  
>> 
>>> 
>>> OK, so I think it's probably reasonable for now to just add DW_TAG_modules to the CU for each referenced module (or does it have to be each referenced submodule? (can two submodules within a single module be contradictory/conflicting?)). Since we don't have any good way to reference the module is a foreign unit while deduplicating that unit... there's not much point having the imported_module - but if you think it adds anything, I'm open to ideas.
>> It could help keeping things simpler.
>> Emitting it doesn’t add much semantic value because module imports always occur at the top level, but it will make the transition to the deduplicated TAG_modules easier — It could be easier to teach consumers once about imported_module({ref to TAG_module}) rather than having them also recognize top-level TAG_modules as an intermediate step. It’s also slightly easier to implement in LLVM because the imported_module allows us to anchor the TAG_module in the CU, but that’s not a very strong argument.
> 
> Agreed on all counts (not a strong argument, but convenient enough, etc, etc).
> 
> I'm still not entirely sure what the right answer is here, though, which is why I'm hesitant to bake anything in too strongly.
> 
> To come back to one of the outstanding questions: Do you need submodule import information, or just module level (if modules cannot have internal conflicts and you can't avoid cross-module conflicts just by lack of visibility (I have no idea if either of those things are true) then you may just need per-module not per-submodule info)?

At the moment I do not think that it makes sense for two submodules to conflict, but there is nothing in the clang documentation that explicitly forbids this. With this in mind, I think it is reasonable to not support submodules (at least initially) and always emit an import for the parent module.
Thats what I wanted to write ... but I as I’m browsing through our documentation, http://clang.llvm.org/docs/Modules.html#conflict-declarations explicitly gives an example of two conflicting submodules, so maybe this is not a reasonable simplification after all. On the other hand, a quick grep over all system module maps on OS X doesn’t show a single conflict declaration.

I still believe we do not need to support submodules right from the start, but we should have a story for getting there if we need to.

> 
> Also, does each submodule need different special attributes/flags? If the special codegen attributes you want are at the module level, it'd probably be best to keep those on the Skeleton CU for the module (that will be comdat folded, etc, on ELF - and they could be DWARF-aware deduplicated by llvm-dsymutil) so they're not duplicated. The DW_TAG_module would then just have a DW_AT_signature attribute or something similarly small/trivial to point to the skeleton CU.

The attributes are derived from cc1 command line arguments. Not two submodules imported by one CU can have different attributes. All submodules in a pcm also share their attributes. Putting them into the skeleton CU appears to be the most efficient place to put them, though perhaps not the most logical one.
I would prefer to stick the attributes on the (top-level) DW_TAG_module and later deduplicate the attributes together with the DW_TAG_module. Sticking them on the skeleton won’t save any space in the .o files and would save 3*4-8=4 bytes (3x FORM_strp for include, macro, and isysroot - 1x FORM_ref_sig_8) per CU and imported module.

> 
> If you need submodule import lists, then each DW_AT_module representing a submodule would have a name (anything else?) and the signature refering to its module skeleton CU.

What I’m envisioning is 

.debug_info:
  DW_TAG_compile_unit
    ...
    DW_TAG_imported_module
     // import FooSubA
     DW_AT_import [DW_FORM_ref4] (0x60)

    DW_TAG_module
      DW_AT_name(“FooLib”)
      DW_AT_LLVM_sysroot(“/“)
      DW_AT_LLVM_include_dirs(“-I/path”)
      DW_AT_LLVM_macros(“-DNDEBUG”)
0x60:
      DW_TAG_module
        DW_AT_name(“FooSubA”)
        // need not be emitted if not referenced.
        DW_TAG_module
          DW_AT_name(“FooSubASubA”)

      // need not be emitted if not referenced.
      DW_TAG_module
        DW_AT_name(“FooSubB”)

-- adrian
>  

>> 
>>> Maybe later (when we have Bag O' DWARF) we can do that. & only do this when targeting lldb (on by default on Darwin, off by default elsewhere).
>>> 
>>> & LLDB, once it's got the Fission support it'll need for this anyway, will fallback gracefully if these special modules are omitted.
>> 
>> Sounds good to me!
>> 
>> -- adrian
>> 
>>> 
>>> - David
>>>  
>>> 
>>>> 
>>>>>  (& have just a single, whole module in the pcm)?
>>>> 
>>>> That’s probably not what you meant, but just to be sure: The pcm will always have the entire module with all submodules in it. But the debugger may choose to import only a subset of those.
>>>> 
>>>>>  
>>>>> file referred to by whichever skeleton CU makes it into the binary:
>>>>> 
>>>>> FooLib-XYZ.pcm
>>>>> ~~~~~~~~~~~~~~
>>>>> 
>>>>> .debug_info.dwo
>>>>>  DW_TAG_compile_unit
>>>>>    DW_AT_dwo_id(“0xFEDB9876”)
>>>>>    ...
>>>>> 
>>>>>  DW_TAG_type_unit (signature: 0xABCD1234)
>>>>>    DW_TAG_module
>>>>>      DW_AT_name(“FooLib”)
>>>>>      ...
>>>>>  DW_TAG_type_unit (signature: 0xCDEF3456)
>>>>>    DW_TAG_module
>>>>>      DW_AT_name(“FooLib”)
>>>>>      DW_TAG_module
>>>>>        DW_AT_name(“SubFoo”)
>>>>>        ...
>>>>> 
>>>>> So.. this should work as long as nobody points out that a module isn’t really a type.
>>>>> 
>>>>> Yeah, probably worth waiting for "Bag O DWARF".
>>>>> 
>>>>> For now, as you mentioned earlier, maybe just putting the imported_module and the module into the compile_unit when tuning for LLDB (so Darwin by default, and anywhere else where someone tunes for LLDB in the future) & leave them out otherwise.
>>>> 
>>>> Sounds prefectly reasonable.
>>>>> 
>>>>> Could you remind me why LLDB wants to know which modules are referenced from a CU? (rather than just all the modules used by a program overall?)
>>>> 
>>>> LLDB uses clang for the expression evaluation. Traditionally it would look up a type in DWARF, build a clang AST out of it and then import it. With this it could directly import the clang modules and have access to everything in the module. But, clang modules are not namespaces, so modules can conflict (and that would probably manifest as a crash in libclang). 
>>>> 
>>>> What's an example of such a conflict? Is that valid (or is it just in ODR violations) - as mentioned above, it seems to me that only importing the things lexically available in this source file isn't what a debugger user would really want. I certainly think I'd trip over that a lot.
>>> 
>>> Keep in mind that Objective-C (and C) do not have an ODR, so it’s not just “just” :-)
>>> Being able to import modules does not mean that the debugger cannot still fall back to loading types from DWARF; in fact it will have to do that for all local types anyway.
>>> 
>>> -- adrian
>>> 
>>>>  
>>>> It therefore needs to know which modules are imported in the current CU before dropping into the expression evaluator.
>>>> 
>>>> - adrian
>>>> 
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> On Macho-O, in the absence of comdats, we have:
>>>>> 
>>>>> bar.o
>>>>> ~~~~~
>>>>> 
>>>>> .debug_info:
>>>>>   DW_TAG_compile_unit
>>>>>     ...
>>>>>     DW_TAG_imported_module
>>>>>       DW_AT_import [DW_FORM_ref4] (0x20)
>>>>> 
>>>>>     DW_TAG_module           // uniqued by dsymutil.
>>>>>       DW_AT_name(“FooLib”)
>>>>>       DW_AT_LLVM_sysroot(“/“)
>>>>>       DW_AT_LLVM_include_dirs(“-I/path”)
>>>>>       DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>>       ...
>>>>> 
>>>>> // Split DWARF skeleton, thrown out by dsymutil.
>>>>> 
>>>>> Thrown out? Because it's going to read everything in from the module and merge it in to a single linked debug info blob, I take it?
>>>>>  
>>>>> .debug_info, group 0xFEDB9876, comdat
>>>>>   DW_TAG_compile_unit
>>>>>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>     ...
>>>>> 
>>>>> FooLib-XYZ.pcm
>>>>> ~~~~~~~~~~~~~~
>>>>> 
>>>>> .debug_info:
>>>>>   DW_TAG_compile_unit
>>>>>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>>     ...
>>>>> 
>>>>>     DW_TAG_module
>>>>>       DW_AT_name(“FooLib”)
>>>>>       DW_TAG_module
>>>>>         DW_AT_name(“SubFoo”)
>>>>>         ...
>>>>> 
>>>>> -- adrian
>>>>> 
>>>>> >
>>>>> >>
>>>>> >>>
>>>>> >>> > If it turns out that's the right way to get a target for the imported_module, we could put both the skeleton CU and the partial unit in the same comdat and dedup them both together.
>>>>> >>>
>>>>> >>> I think this works as long as we only have one TAG_module per .pcm file (because we need to refer to it via signature).
>>>>> >>>
>>>>> >>> Not quite following here - why would we have more than one module per pcm - a pcm is a module, right?
>>>>> >>
>>>>> >> Clang modules may have submodules and a compile unit could import two submodules that live in the same .pcm file. For example on Darwin there is a module Darwin.pcm that contains a submodule “C" that contains the submodule “stdio".
>>>>> >
>>>>> > OK, so this bit's relevant to your use case in LLDB of loading the right things for the right context, but not relevant to the context-less debuggers like GDB that will just treat everything as one big namespace (except for file-local things, etc). So it's important for your imported modules but not for the basic Fission style debug reference.
>>>>> >
>>>>> > Well, maybe - I'm not sure what you're picturing in terms of the DWARF in the module for submodules? If you want that granularity we'll have to talk about how to split the DWARF in the module into chunks per submodule?
>>>>> >
>>>>> >>
>>>>> >>>
>>>>> >>> But if we don’t mind having duplicate dwo_* references in the same .o file this would also work with more than one TAG_module (or submodules).
>>>>> >>>
>>>>> >>>
>>>>> >>> .debug_info:
>>>>> >>>  DW_TAG_compile_unit
>>>>> >>>    DW_AT_name(“bar.c”)
>>>>> >>>    ...
>>>>> >>>
>>>>> >>>    DW_TAG_imported_module // <- This could be optional on ELF.
>>>>> >>>      DW_AT_import [DW_FORM_ref_sig8] (0xFEDB9876)
>>>>> >>>
>>>>> >>>    ...
>>>>> >>>
>>>>> >>> // Comdat’d split DWARF skeleton CU for the module Foo.
>>>>> >>> .debug_info, group 0xFEDB9876, comdat
>>>>> >>>  DW_TAG_compile_unit
>>>>> >>>    DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>> >>>    DW_AT_dwo_id(“0xFEDB9876”)
>>>>> >>>    ...
>>>>> >>>
>>>>> >>>    DW_TAG_module
>>>>> >>>      DW_AT_name(“FooLib”)
>>>>> >>>      DW_AT_LLVM_sysroot(“/“)
>>>>> >>>      DW_AT_LLVM_include_dirs(“-I/path”)
>>>>> >>>      DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>> >>>      ...
>>>>> >>>
>>>>> >>>
>>>>> >>> >
>>>>> >>> > But this gets into complicated territory when the original binary is built with fission... which will be relevant for modules on ELF with LLDB. Hmm, maybe it's not too complicated - the partial_unit would end up in the .dwo file (maybe we'd have to teach the .dwo file to deduplicate these too - the same way it does for type units... - might require a new header to include the hash, etc :/)... would be tricky to have the dwp tool resolve the relocations to these things. Cross-unit references as you've got there aren't something that every DWARF consumer is totally cool with, I don't think?
>>>>> >>>
>>>>> >>> Ah. I thought the deduplication happens because all ELF sections sharing the same group are uniqued based on the group id.
>>>>> >>>
>>>>> >>> COMDAT groups deduplicate for a normal non-fission build, but fission linking doesn't require the .dwo file to use/contain COMDATs as it uses a DWARF-aware tool (so you don't bother putting the type units in COMDAT groups, for example - the fission linker knows how to parse debug_types, find the type unit headers and their hashes and deduplicates them that way).
>>>>> >>
>>>>> >> Ok that makes sense.
>>>>> >>
>>>>> >> -- adrian
>>>>> >>
>>>>> >>>
>>>>> >>> It certainly would be nice if we could avoid introducing a new .debug_info header...
>>>>> >>>
>>>>> >>> >
>>>>> >>> > Sort of inclined to have the imported module stuff just for LLDB, but I've lost some of the context for that in the ensuing weeks.
>>>>> >>>
>>>>> >>> -- adrian
>>>>> >>>
>>>>> >>> >
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> MachO (no typeunits, no comdats, with imports)
>>>>> >>> >> ----------------------------------------------
>>>>> >>> >>
>>>>> >>> >> Since we don’t have comdat sections in Mach-O and we don’t have the tool support for type units, the way that external types can be referenced necessarily needs to be a bit different. The design that Greg and I came up with for Mach-O relies on llvm-dsymutil to fix up the DWARF for non-module-aware consumers. Just as ELF DWARF consumers need not be able to tell the difference between module debugging an split DWARF, on Mach-O the .dSYM bundle generated by llvm-dsymutil looks like traditional DWARF.
>>>>> >>> >>
>>>>> >>> >> There are three differences in the DWARF output that make this possible:
>>>>> >>> >>   - Refer to external types by UID rather than by type signature.
>>>>> >>> >>     (This doubles as the key that allows a debugger to look import the type
>>>>> >>> >>      directly from the AST and protects us against hash collisions)
>>>>> >>> >>   - Add an index to the .o file that maps UID -> module file.
>>>>> >>> >>     (Fast lookup + UIDs for C and ObjC are only unique within a module)
>>>>> >>> >>   - Add an entry for each type’s UID to the types accelerator table.
>>>>> >>> >>     (Fast lookup)
>>>>> >>> >>
>>>>> >>> >> bar.o
>>>>> >>> >> ~~~~~
>>>>> >>> >>
>>>>> >>> >> .debug_info:
>>>>> >>> >>   DW_TAG_compile_unit
>>>>> >>> >>     DW_AT_name(“bar.c”)
>>>>> >>> >>     DW_TAG_imported_module
>>>>> >>> >>       DW_AT_import(DW_FORM_ref_addr 0x40)
>>>>> >>> >>
>>>>> >>> >>     DW_TAG_variable
>>>>> >>> >>       DW_AT_name(“MyFoo”)
>>>>> >>> >>       DW_AT_type [DW_FORM_strp] (“_ZTS3Foo”)  // We could use a custom FORM here
>>>>> >>> >>
>>>>> >>> >>   // Skeleton unit.
>>>>> >>> >>   DW_TAG_compile_unit
>>>>> >>> >>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm”)
>>>>> >>> >>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>> >>> >>     ...
>>>>> >>> >> 0x40:
>>>>> >>> >>     DW_TAG_module
>>>>> >>> >>       DW_AT_name(“FooLib”)
>>>>> >>> >>       DW_AT_LLVM_sysroot(“/“)
>>>>> >>> >>       DW_AT_LLVM_include_dirs(“-I/path”)
>>>>> >>> >>       DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>> >>> >>
>>>>> >>> >> // This index uses the usual accelerator table format.
>>>>> >>> >> .apple_exttypes:
>>>>> >>> >> { “_ZTS3Foo” => debug_str offset of ”/tmp/org.llvm.clang/ModuleCache/1234ABCDE/FooLib-XYZ.pcm” }
>>>>> >>> >>
>>>>> >>> >> FooLib-XYZ.pcm
>>>>> >>> >> ~~~~~~~~~~~~~~
>>>>> >>> >>
>>>>> >>> >> .debug_info
>>>>> >>> >>   DW_TAG_compile_unit
>>>>> >>> >>     DW_AT_dwo_id(“0xFEDB9876”)
>>>>> >>> >>
>>>>> >>> >> 0x80:
>>>>> >>> >>   DW_TAG_structure_type
>>>>> >>> >>     DW_AT_name (“Foo”)
>>>>> >>> >>     DW_AT_signature
>>>>> >>> >>     ...
>>>>> >>> >>
>>>>> >>> >> // In addition to the entry for “Foo”, there is also an entry for the type’s UID “_ZTS3Foo” pointing to the type definition DIE.
>>>>> >>> >> .apple_types
>>>>> >>> >> { “Foo” => 0x80 }
>>>>> >>> >> { “_ZTS3Foo” => 0x80 }
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> When the debug info linker (llvm-dsymutil) is run, it first pulls in the .debug_info section from the clang module and fixes up all the DW_FORM_strp external type references by turning them into a DW_FORM_ref_addr that references the type in the DW_TAG_compile_unit pulled in from the module. To find the correct type DIE it looks up the UID in the .apple_exttypes index, finds the module, looks up the UID in the regular .apple_types accelerator table and replaces the temporary DW_FROM_strp with a DW_FORM_ref_addr (which incidentally takes up the same amount of space in the DIE).
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> Thoughts?
>>>>> >>> >> --
>>>>> >>> >> adrian
>>>>> >>> >>
>>>>> >>> >
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>> 
>>> 
>> 
>> 
>