[PATCH] Have clang list the imported modules in the debug info

Mon Mar 16 13:35:41 PDT 2015

On Mon, Mar 16, 2015 at 1:24 PM, Adrian Prantl <aprantl at apple.com> wrote:

>
> On Mar 10, 2015, at 12:10 PM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
>
> On Tue, Mar 10, 2015 at 12:05 PM, Adrian Prantl <aprantl at apple.com> wrote:
>
>>
>> On Mar 9, 2015, at 5:16 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>
>>
>>
>> On Mon, Mar 9, 2015 at 5:07 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>>
>>> On Mar 9, 2015, at 2:14 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>
>>>
>>>
>>> On Mon, Mar 9, 2015 at 1:52 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>
>>>>
>>>> On Feb 24, 2015, at 3:06 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>> On Tue, Feb 24, 2015 at 2:56 PM, Adrian Prantl <aprantl at apple.com>
>>>> wrote:
>>>>
>>>>> On Feb 24, 2015, at 2:36 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>>
>>>>> On Mon, Feb 23, 2015 at 3:45 PM, Adrian Prantl <aprantl at apple.com>
>>>>> wrote:
>>>>>
>>>>>> On Feb 23, 2015, at 3:37 PM, David Blaikie <dblaikie at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Mon, Feb 23, 2015 at 3:32 PM, Adrian Prantl <aprantl at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Feb 23, 2015, at 3:14 PM, David Blaikie <dblaikie at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Mon, Feb 23, 2015 at 3:08 PM, Adrian Prantl <aprantl at apple.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Feb 23, 2015, at 2:59 PM, David Blaikie <dblaikie at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Mon, Feb 23, 2015 at 2:51 PM, Adrian Prantl <aprantl at apple.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> > On Jan 20, 2015, at 11:07 AM, David Blaikie <dblaikie at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > My vague recollection from the previous design discussions was
>>>>>>>>> that these module references would be their own 'unit' COMDAT'd so that we
>>>>>>>>> don't end up with the duplication of every module reference in every unit
>>>>>>>>> linked together when linking debug info?
>>>>>>>>> >
>>>>>>>>> > I think in my brain I'd been picturing this module reference as
>>>>>>>>> being an extended fission reference (fission skeleton CU + extra fields for
>>>>>>>>> users who want to load the Clang AST module directly and skip the split CU).
>>>>>>>>>
>>>>>>>>> Apologies for letting this rest for so long.
>>>>>>>>>
>>>>>>>>> Your memory was of course correct and I didn’t follow up on this
>>>>>>>>> because I had convinced myself that the fission reference would be
>>>>>>>>> completely sufficient. Now that I’ve been thinking some more about it, I
>>>>>>>>> don’t think that it is sufficient in the LTO case.
>>>>>>>>>
>>>>>>>>> Here is the example from the
>>>>>>>>> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-November/040076.html
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> foo.o:
>>>>>>>>> .debug_info.dwo
>>>>>>>>>   DW_TAG_compile_unit
>>>>>>>>>      // For DWARF consumers
>>>>>>>>>      DW_AT_dwo_name ("/path/to/module-cache/MyModule.pcm")
>>>>>>>>>      DW_AT_dwo_id   ([unique AST signature])
>>>>>>>>>
>>>>>>>>> .debug_info
>>>>>>>>>   DW_TAG_compile_unit
>>>>>>>>>     DW_TAG_variable
>>>>>>>>>       DW_AT_name "x"
>>>>>>>>>       DW_AT_type (DW_FORM_ref_sig8) ([hash for MyStruct])
>>>>>>>>>
>>>>>>>>> In this example it is clear that foo.o imported MyModule because
>>>>>>>>> its DWO skeleton is there in the same object file. But if we deal with the
>>>>>>>>> result of an LTO compilation we will end up with many compile units in the
>>>>>>>>> same .debug_info section, plus a bunch of skeleton compile units for _all_
>>>>>>>>> imported modules in the entire project. We thus loose the ability to
>>>>>>>>> determine which of the compile units imported which module.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why would we need to know which CU imported which modules? (I can
>>>>>>>> imagine some possible reasons, but wondering what you have in mind)
>>>>>>>>
>>>>>>>>
>>>>>>>> When the debugger is stopped at a breakpoint and the user wants to
>>>>>>>> evaluate an expression, it should import the modules that are available at
>>>>>>>> this location, so the user can write the expression from within the context
>>>>>>>> of the breakpoint (e.g., without having to fully qualify each type, etc).
>>>>>>>>
>>>>>>>
>>>>>>> I'm not sure how much current debuggers actually worry about that -
>>>>>>> (& this may differ from lldb to gdb to other things, of course). I'm pretty
>>>>>>> sure at least for GDB, a context in one CU is as good as one in another (at
>>>>>>> least without split-dwarf, type units, etc - with those sometimes things
>>>>>>> end up overly restrictive as the debugger won't search everything properly).
>>>>>>>
>>>>>>> eg: if you have a.cpp: int main() { }, b.cpp: void func() { } and
>>>>>>> you run 'start' in gdb (which breaks at the beginning of main) you can
>>>>>>> still run 'p func()' to call the func, even though there's no declaration
>>>>>>> of it in a.cpp, etc.
>>>>>>>
>>>>>>>
>>>>>>> LLDB would definitely care (as it is using clang for the expression
>>>>>>> evaluation supporting these kinds of features is really straightforward
>>>>>>> there). By importing the modules (rather than searching through the DWARF),
>>>>>>> the expression evaluator gains access to additional declarations that are
>>>>>>> not there in the DWARF, such as templates. But since clang modules are not
>>>>>>> namespaces, we can’t generally "import the world” as a debugger would
>>>>>>> usually do.
>>>>>>>
>>>>>>
>>>>>> Sorry, not sure I understand this last sentence - could you explain
>>>>>> further?
>>>>>>
>>>>>> I imagine it would be rather limiting for the user if they could only
>>>>>> use expressions that are valid in this file from the file - it wouldn't be
>>>>>> uncommon to want to call a function from another module/file/etc to aid in
>>>>>> debugging.
>>>>>>
>>>>>>
>>>>>> Usually LLDB’s expression evaluator works by creating a clang AST
>>>>>> type out of a DWARF type and inserting it into its AST context. We could
>>>>>> pre-polulate it with the definitions from the imported modules (with all
>>>>>> sorts of benefits as described above), but that only works if no two
>>>>>> modules conflict. If the declaration can’t be found in any imported module,
>>>>>> LLDB would still import it from DWARF in the “traditional” fashion.
>>>>>>
>>>>>
>>>>> But it would import it from DWARF in other TUs rather than use the
>>>>> module info just because the module wasn't directly referenced from this
>>>>> TU? That would seem strange to me. (you would lose debug info fidelity (by
>>>>> falling back to DWARF even though there are modules with the full fidelity
>>>>> info) unnecessarily, it sounds like)
>>>>>
>>>>>
>>>>> I think it’s reasonable to expect full fidelity for everything that is
>>>>> available in the current TU, and having the normal DWARF-based debugging
>>>>> capabilities for everything beyond that. But we can only ever provide full
>>>>> fidelity if we have the list of imports for the current TU.
>>>>>
>>>>>
>>>>> Would it be reasonable to use the accelerator table/index to lookup
>>>>> the types, then if the type is in the module you could use the module
>>>>> rather than the DWARF stashed alongside it? (so the comdat'd split-dwarf
>>>>> skeleton CU for the module would have an index to tell you what names are
>>>>> inside it, but if you got an index hit you'd just look at the module
>>>>> instead of loading the split-dwarf debug info in the referenced file)
>>>>>
>>>>>
>>>>> I don’t think this approach would work for templates and enumerator
>>>>> values;
>>>>>
>>>>
>>>> Not sure why enumerator values are an issue - but templates (& all
>>>> manner of other things that don't make it into the index, unfortunately),
>>>> sure.
>>>>
>>>>
>>>>> they aren’t in the accelerator tables to begin with. It would also be
>>>>> slower if the declaration is available in a module.
>>>>>
>>>>
>>>> Though you're rapidly going to end up loading a lot of modules in (as
>>>> you go up & down a stack printing various things you'll cross into other
>>>> TUs & load more modules).
>>>>
>>>> For a standard DWARF consumer, it seems fine to just have a comdat'd
>>>> skeleton CU for a module without the need for other CUs to mention which
>>>> module CUs they reference (but I could be wrong here) & that's the design
>>>> we originally discussed.
>>>>
>>>> It would seem unfortunate to bloat every CU with a non-deduplicable
>>>> list of every module it references, but if that's necessary for a
>>>> serialized AST aware debugger, it might be fine to have it as an option (so
>>>> long as it can be turned off) & may still benefit from that list not being
>>>> the authoritative module reference, but a /very/ terse reference to it so
>>>> all the extra flags & stuff can be in the deduplicable comdat (& to keep it
>>>> as consistent as possible between the flag (on/off) codepaths for this
>>>> extra data). Maybe a FORM_block (?) of fixed-size hashes of all the modules
>>>> back-to-back, so it's as small as possible?
>>>>
>>>> But I wouldn't mind spending some more time discussing whether there's
>>>> a better way to keep these things streamlined/symmetric/the same between
>>>> modular and non-modular debug info.
>>>>
>>>> Sure!
>>>> Now that we established that recording the list of imported modules for
>>>> every CU is useful for an AST-based debugger,
>>>>
>>>
>>> +Richard, just to see if he's got some ideas about how a debugger might
>>> efficiently use modules to support debugger scenarios and whether or not
>>> having a list of which modules are referenced from which contexts is
>>> valuable in that.
>>>
>>> It still concerns me that this would create something of a
>>> regression/oddity/difference between AST-based debug info (you wouldn't be
>>> able to handle expressions referencing things in other TUs) and non-AST
>>> based debug info (where I think the average user is used to not worrying
>>> about what headers are included in the current file they're debugging when
>>> they try to use a type or other identifier)
>>>
>>>
>>> If I understood you correctly, this is not actually the case. The list
>>> of imported modules allows the AST-based debugger to import all the modules
>>> that were imported by the CU that the current frame is in. This enables the
>>> user to, e.g., type "p myVector->size()" even though
>>> std::vector<MyClass>::size() was not used by the CU and is thus not
>>> available in DWARF.
>>>
>> If the user types “p foo” even though foo was not defined in any imported
>>> module the debugger can — after failing to import foo via clang — still
>>> fall back to looking up foo in DWARF and do what it always did.
>>>
>>
>> If you do the DWARF fallback then you'll get a pretty clear inconsistency
>> between templates and non-templates. If I have a function foo and a
>> function template foo_tmpl in one file, and I'm debugging in another file
>> I'll be able to call 'foo' (normal DWARF fallback/search) but not foo_tmpl
>> (if I'm calling a new instantiation of foo_tmpl - if I'm calling an
>> existing instantiation presumably the fallback would catch me). Seems
>> unfortunate/confusing, perhaps.
>>
>>
>> Good point, but it my guess is that this wouldn’t be any worse than the
>> “why can’t I print the size() of this vector!?”-situations we have at the
>> moment.
>>
>
> Sure - it's strictly better in the sense that there are strictly more
> expressions that can be evaluated, but seems incomplete is my point, and
> maybe worth considering alternative designs that might be more-betterer.
>
>
>> In certain situations (i.e., non-templates) the debugger could use the
>> DWARF in the modules to print a message about which module to import.
>>
>>
>>
>>>
>>>
>>>
>>>> let’s talk about how to most efficiently represent this information.
>>>>
>>>> In the CU, using DW_TAG_imported_module appears to be the most
>>>> appropriate choice, even though there is some room for confusion since C++
>>>> using declarations are also represented this way. Inside the
>>>> DW_TAG_imported_module, we could use
>>>> (1) a DW_AT_import that references the skeleton (I hope that is the
>>>> right terminology) CU for the module, the idea being that the skeleton CU
>>>> would contain all the details (flags, name, include dirs, hash, ...) and be
>>>> in a comdat'ed section.
>>>>
>>>
>>> I'd be concerned about overloading the terminology & confusing other
>>> debuggers - they might try to follow the DW_AT_import and be surprised that
>>> it doesn't refer to a DW_TAG_namespace tag.
>>>
>>>
>>> That’s a valid concern, and we probably should not be emitting this if
>>> we have any evidence of, e.g., gdb crashes when encountering such a
>>> construct. Then again, we would be using a DW_TAG_imported_module to
>>> express what it is meant to express according to the DWARF spec (namely
>>> importing a module)... but I admit that the tag also does have a very
>>> specific meaning for C++, which we maybe shouldn’t overload.
>>>
>>
>> That's my concern, yes.
>>
>>
>>> The right thing here is probably to put aside my personal sense of
>>> aesthetics and use a private _LLVM_ namespace for all new additions, and
>>> then attempt to standardize an official DWARF version once we know what is
>>> really needed and what isn't.
>>>
>>
>> I'd prefer this, yes. I mean the usual bar we use for language features
>> is that they're at least proposed for standardization before we adopt them
>> in clang - I wouldn't mind a similar bar here. If you want to bring up this
>> use of DW_TAG_imported_module with the DWARF committee & see if it sounds
>> reasonable (& test/inquire about GDB's behavior here).
>>
>>
>> I started a thread on dwarf-discuss to this end (
>> http://lists.dwarfstd.org/listinfo.cgi/dwarf-discuss-dwarfstd.org, the
>> list archives are only visible to subscribers, but anyone can subscribe).
>>
>
> Cool cool
>
>
> To paraphrase the replies that my question solicited: We are, perhaps not
> very surprising, encouraged to follow the standard and use a
> DW_TAG_imported_module that references a DW_TAG_module. If we, however,
> choose to describe the module by using a skeleton DW_TAG_compile_unit, we
> should be careful (my own words) about using a DW_TAG_imported_module until
> that use is sanctioned by the standard.
>
> I see two possible ways to proceed in this spirit:
> a) Rename the module skeleton DW_TAG_compile_units to DW_TAG_module, but
> keep all the comdat/split dwarf goodness from the original proposal [1]. My
> understanding is that even though we are making clever use of the split
> DWARF features, GDB would still need to be taught to follow references to
> external files,
>

Not sure what you're referring to here, perhaps a misunderstanding about
how split DWARF works.

To the best of my knowledge, what we've talked about for module DWARF debug
info is actually just split-dwarf, no extra work required by DWARF
consumers*.

* It's, admittedly, a little tricksy to include type unit references in an
object file that doesn't include the type unit at all - relying on it being
linked into the final executable. But DWARF doesn't really talk about
objects versus executables, etc - so, so long as the type unit is there in
the end, it's valid DWARF no matter how it got there (& should work fine
for existing consumers - they can't tell if the type unit was in every
object file that referenced the type or not once it's been linked and
deduplicated).

> so having it recognize a new tag in this context doesn’t appear to be much
> additional effort (but others may provide more insight here).
>

Beyond the above (that using a new tag would mean this would go from 'free'
to 'not free' for GDB) having a new top level tag is pretty substantial (we
only have two at the moment, and with our talk of modules being a "bag of
dwarf" might go back to having one top level tag? (it's not clear to me
from DWARF4 whether DW_TAG_module is currently a top-level tag, I don't
think it is?)

> b) Emit an LLVM-specific DW_AT_LLVM_import attribute inside the
> DW_TAG_imported_module (or vice versa) that refers to the skeleton
> DW_TAG_compile_unit.
>
> I think that option (a) is a bit more elegant and it is bending the dwarf
> standard not quite as much and will make the dwarf output a bit more
> readable.
>
> -- adrian
>
> [1] Module debugging proposal for reference:
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-November/040076.html
>
>
> - David
>
>
>>
>> -- adrian
>>
>>
>> But extension tags seems like the conservatively correct option (not sure
>> what GDB does on tags it doesn't recognize - I forget if it warns or just
>> completely ignores them, hopefully the latter)
>>
>>>
>>>
>>>> (2) David’s suggestion of using a custom form that records the module
>>>> hash directly is quite space-efficient, but it has the drawback of not
>>>> being resilient against small changes to the imported module
>>>>
>>>
>>> That's going to be true of the normal fission info here (the skeleton CU
>>> and the full CU in the .dwo file (or module) are associated by hash) -
>>> granted, in the "loading an AST" mode, you can ignore those hashes and rely
>>> on your custom attributes instead.
>>>
>>>
>>>> , since clang’s module hash changes each time the module is being
>>>> rebuilt.
>>>>
>>>
>>> Clang's module hash only changes if the DWARF contents change - it
>>> doesn't use a timestamp or anything. It seems like actually you're going to
>>> want to fail to load even more aggressively - there are ways the AST
>>> might've changed that the debug info doesn't reflect but are still
>>> important (a type unreferenced in this module, but built into some other
>>> code that is not built with debug info changes - no hash changes because
>>> the debug info for that type is unreferenced here, but if you try to use it
>>> you could have an incompatible layout, etc).
>>>
>>>
>>> Agreed: If the module contents changed the debugger needs to display a
>>> big flashing "here be dragons" warning.
>>>
>>>
>>>
>>>> This is less of an issue if the hash is referring to a skeleton CU in
>>>> the same file, which contains all the detailed information.
>>>>
>>>> Personally I’d prefer option 1 because mostly uses the existing
>>>> mechanisms from DWARF. Here’s a visual guide to the options on the table:
>>>>
>>>> (1)
>>>> foo.o (compiled with, let’s call it .. "-gmodule-imports”)
>>>> -----
>>>> .debug_info:
>>>>   DW_TAG_compile_unit
>>>>     DW_AT_name(“foo.c”)
>>>>     DW_TAG_imported_module
>>>>       DW_AT_import(DW_FORM_ref_addr 0x123)  // Could be a FORM_ref_sig8
>>>> 0x1234ABCDE as well.
>>>>     DW_TAG_imported_module
>>>>       DW_AT_import(...)
>>>>
>>>> .debug_info.dwo:
>>>> // Skeleton CUs for modules imported by foo.o.
>>>> 0x123:
>>>>   DW_TAG_compile_unit
>>>>     // Used by split-dwarf debuggers to find external type definitions.
>>>>     DW_AT_dwo_name(“/tmp/org.llvm.clang/ModuleCache/1234ABCDE/
>>>> Foundation.pcm”)
>>>>     DW_AT_dwo_id(“0x1234ABCDE”)
>>>>
>>>>     // Used by AST-based debuggers to import the module.
>>>>     DW_AT_name(“Foundation”)
>>>>
>>>
>>> (side notes: the mixed indentation here makes it a bit hard to read this
>>> example, and I'd make sure /all/ the extended attributes (including the
>>> name here) use custom attribute names, not standard ones)
>>>
>>>
>>> Agreed.
>>>
>>>
>>>
>>>>     DW_AT_LLVM_sysroot(“/“)
>>>>     DW_AT_LLVM_include_dir(“”)
>>>>     DW_AT_LLVM_macros(“-DNDEBUG”)
>>>>
>>>> (2)
>>>> .debug_info.dwo:
>>>> (As above.)
>>>>
>>>> .debug_info:
>>>>   DW_TAG_compile_unit
>>>>     DW_AT_name(“foo.c”)
>>>>     DW_AT_LLVM_imported_modules(DW_FORM_block 0x1234ABCDE 0xDEADBEEF
>>>> 0x....)
>>>>
>>>> Now I’m curious what option (3) will look like; the one that we’ll
>>>> actually implement!
>>>>
>>>
>>> ;)
>>>
>>>
>>>
>>> -- adrian
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150316/7713866a/attachment.html>