[PATCH] Have clang list the imported modules in the debug info

Mon Aug 24 14:01:18 PDT 2015

On Mon, Aug 24, 2015 at 1:23 PM, Adrian Prantl <aprantl at apple.com> wrote:

>
> On Aug 19, 2015, at 1:20 PM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
>
> On Mon, Aug 10, 2015 at 5:00 PM, Adrian Prantl <aprantl at apple.com> wrote:
>
>>
>> On Jul 24, 2015, at 12:33 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>
>> *reads back through the thread*
>>
>>
>> appreciated, it’s long :-)
>>
>> So what I originally had in mind about a year ago when we discussed this,
>> was that the module data could have an extra table from type hash to
>> whatever useful internal representation to find the type in the PCM.
>>
>>
>> It turned out that the most useful internal type representation to find a
>> type in a PCM is the type’s DeclContext+Name; this is how (surprise!) clang
>> looks up types in a PCM and the format is supposed to be fast for these
>> kind of lookups.
>>
>
> Still, I would imagine there would be some kind of direct access (the
> offset in the file, or somesuch) rather than actually having to go through
> hashtables, etc. No? (how does one module refer to types in another module?
> Really by name?)
>
>
> Entities in PCMs have local integer IDs (just consecutive numbers) that
> are used to encode references inside a record on disk. An external
> reference gets a global ID which is the local ID + the ID of the other
> module. This numbering scheme of course only makes sense within a module
> (and perhaps also within a chained PCH). Every (local and global) ID maps
> to an entry in the PCM’s identifier table. When deserializing, the
> in-memory global IdentifierTable is built and each module gets a map that
> remaps its internal global IDs to the “global” global IDs.
>
> For debug info these IDs are not very useful, because they are not
> resilient against even the smallest additive change to the module. We could
> add an integer attribute with the module-internal entity ID to the forward
> declaration that the debugger can use if the module hash and the CU’s dwoid
> are matching, but I’m not yet convinced that it would be worth it. Adding
> it to the definition of the type in the module dwarf seems not worth it
> because then we already have to do a similarly expensive lookup to find the
> module containing the definition.
>

It seems that ideally a module-aware debugger would not consult the DWARF
at all, so I wouldn't advocate having the module format type ID in the
DWARF, but in a side table (either part of the module itself, or just a
section of its own in the module-as-object-file).

Why is it important that the IDs be resilient to changes in the module? I'm
suggesting that non-module debug info should only use type hashes, then
within the module itself there would be a hash-to-ID mapping (a flat table,
probably, wouldn't need to be anything fancy). This would keep the DWARF
less polluted by module concepts - it'd just be up to the consumer "oh,
here's a type hash identifier, resolve that to an AST - either by looking
in DWARF and building an AST from the DWARF type unit with the matching
hash, or by looking in the module and finding the ID/loading the type"?

>
>
>> Everything else would just be DWARF with type units and fission (with the
>> slight wrinkle of type units that aren't resolvable within a single object
>> file - they could reference cross-object/dwo file) - emitting a fission CU
>> for each referenced module.
>>
>> Needing modules to disambiguate/avoid collisions/support non-odr
>> languages wasn't something I understood/had considered back then. That
>> explains the need to add module references to the CU, so the debugger can
>> know which modules to search for the types in (& doesn't just go loading
>> all of them, etc).
>>
>> I would still picture this as "normal type units + a table in the module
>> to resolve types", but if you guys particularly like using the mangled
>> string name (or other identifier) in the DWARF that may avoid the need for
>> an intermediate table (but it doesn't sound like you are avoiding an
>> intermediate table - you said something about having an
>> accelerator-table-like thing to aid in the DWARF->AST mapping? So could
>> that be key'd of the type hash/signature we're using, thus keeping the
>> DWARF more plain/vanilla DWARF5 (type units + fission)?)
>>
>>
>> I originally disliked type signatures and favored using mangled names
>> because the mangled names contained the DeclContext necessary to find types
>> in the PCM. But if we can squeeze the DeclContext somewhere else, that’s
>> fine.
>>
>> From the discussion we had on FlagExternalTypeRef I got the impression
>> that long-form forward declarations are starting to look more attractive:
>> If every external type reference is a reference to a forward declaration
>> that has a complete decl context,
>>
>
> While that's conveniently what we output currently, I'm not sure it's a
> great idea to rely on it. We might one day optimize type references (&
> we'll certainly need to optimize non-type references like member functions,
> etc - since emitting a stub for those would start to, more visibly, reduce
> the benefit of doing this work in the first place, I would imagine) so that
> when there's no contents (which will be more common once we can reference
> members directly with Bag O DWARF) we just have a DW_AT_type encoded as a
> DW_FORM_ref_sig8 directly.
>
>
> Emitting a ref_sig8 directly only works in C++, where the ODR guarantees
> that each signature is globally unique.
>

An identifier will be needed for types without the ODR too - DWARF suggests
hashing the type description (the actual DIE hierarchy) the same as is done
for Fission.

> A consumer would have to search for the type definition in all modules
> (rather than finding it via the decl context), which is less efficient, but
> still works because of the ODR. Thus we’re only depending on the long-form
> forward declarations in non-ODR languages, where we need the decl context
> to disambiguate between non-unique signatures.
>

We shouldn't use the mangled name as the hash for types that don't have an
ODR. The current type units implementation does not create type units for
non-ODR types. (the same way we only do type refs for ODR types, the
current type units implementation piggy backs on exactly the same property
and for the same reasons)

Searching all .dwos for DWARF type units is an existing problem for pure
DWARF type units with Fission - it might be worth considering (talking to
the committee) what would be the right solution for DWARF type
units+Fission and then seeing if that expands to module debug info.

>
>
> with a DW_TAG_module at the root of the decl context chain,
>>
>
> This ^ is something I didn't have in mind and would complicate things
> somewhat. I'd really like to keep things as close to the existing standard
> type unit + split dwarf standards as possible except where necessary to do
> otherwise.
>
>
> In what way does it complicate things?
>

Debuggers probably aren't setup to handle types inside modules in this way
(at least for C++) - I've never seen this construct & imagine debuggers
haven't either & may have some trouble with it (how to name it/make it
nameable by the user, etc - maybe they'd end up putting a module name
prefix on it or something that would be unhelpful?)

I'd really like to stick to known/existing DWARF as much as possible &
consider very carefully anywhere we diverge from that.

>
>
>
>> and a DW_AT_name+DW_AT_signature at the other end, we would have all the
>> information we need without introducing any further LLVM-specific DWARF
>> extensions. To look up an external type from the PCM, the consumer imports
>> the DW_TAG_module and deserializes the type found by declcontext+name. To
>> load the type from DWARF, the consumer grabs the signature from the forward
>> declaration and magically (1) finds the PCM and looks up the type by
>> signature (2).
>>
>> (1) My suggestion is to extend LLVM so it can put the DW_TAG_module with
>> the forward declaration inside the skeleton compile unit (which has the
>> path to the PCM and its DWOid).
>> (2) On ELF with type units this works out of the box,
>>
>
> Not necessarily - the use of DW_TAG_modules in the scope chain might
> confuse/break things. It's pretty unprecedented/non-standard, I would think?
>
>
> It’s use for C++ is unprecedented, but the tag itself very standard.
>

Sure - I get that the tag is standard, but what it means (& especially what
it means to have types inside it) in the context of C++ debugging is
something I'm fairly concerned about.

> on MachO without type units we need some kind of index mapping signature
>> -> DIE (bag of DWARF style?).
>>
>
> I was rather hoping you guys would implement type units (since they'll be
> a step towards Bag O DWARF anyway) on MachO... - at least for this case.
> They wouldn't have to be COMDAt'd or each in their own section, they'd just
> be in a debug_types section one after the other in the module .o file.
>
>
> How does a consumer today find a type unit for a given signature? Does it
> build its own index based on the signatures of the COMDAT sections? DWP
> files define a .debug_tu_index accelerator section to this end, but how is
> this normally handled?
>

.dwo files have no COMDAT sections - you just put all the type units in
them directly.

In any case, that's still going to be a perf hit, because you'll at least
have to read some part of every .dwo file to see which types are in it.

I would believe/assume/imagine this is a performance problem for Type Units
+ Fission as they are defined today (as I've alluded to above) & may be
worth considering general DWARF solutions (talk with the committee sort of
stuff) that may generalize to module debug info too. It's also possible
I've missed something and there are existing solutions to the Type Units +
Fission performance problem.

>
>
> Assuming that many external types will share a similar DeclContext prefix
>> I am not very worried by the space needed to store the long forward
>> references. Compared to storing the mangled name for every type
>>
>
> What I was picturing wasn't to storet the mangled name anywhere - but to
> have, in the module object file somewhere a table from type hash to <useful
> way of accessing a type in a module, which I hope is just a byte offset or
> something similary cheap, small, and fast - not a table lookup, etc, but
> whatever the table lookup has as its values already>.
>
>
> With the decl context and the TAG_module on top a consumer knows which
> module to import (fast) and can do the import by name (hash lookup). With
> the table in the module object and no TAG_module in the decl context, it
> has to scan all modules for the type signature (multiple hash lookups) and
> can directly import the type (fast).
>
> -- adrian
>
>
>
>> it will often actually take up less space. Also, on Darwin at least,
>> llvm-dsymutil can strip them out of the way after resolving the external
>> type references.
>>
>> -- adrian
>>
>>
>> - Dave
>>
>> On Wed, May 6, 2015 at 3:43 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>>
>>> On May 6, 2015, at 3:24 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>
>>>
>>>
>>> On Wed, May 6, 2015 at 3:15 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>>
>>>>
>>>> On May 6, 2015, at 2:52 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, May 6, 2015 at 2:45 PM, Adrian Prantl <aprantl at apple.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On May 6, 2015, at 2:35 PM, Eric Christopher <echristo at gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> That said, add enough to the name for hashing purposes to make it hash
>>>>> uniquely? Or you can go down the path of hashing the type similar to the
>>>>> fission CU hashing (which is what type units were arguably designed to do
>>>>> in the first place if you take a look at the standard, we just only use
>>>>> them for ODR compliant languages etc right now).
>>>>>
>>>>>
>>>>> I suppose one could hash the entire module configuration + the mangled
>>>>> name and get something that is relatively stable.
>>>>> For implementation reasons it would be terrible to do the full fission
>>>>> hashing because that would mean that we would actually have to look up (and
>>>>> deserialize the type) in order to get to its ID when emitting an external
>>>>> type reference, which would void at least some of the performance gains we
>>>>> want from module debugging.
>>>>>
>>>>
>>>> I thought you were proposing using the mangled name of the type for the
>>>> identifier anyway? Perhaps I misunderstood - what are you proposing to use?
>>>> In any case, I'd prefer to see whatever it is hashed and used as the type
>>>> unit signature for compatibility with DWARF5, rather than adding an
>>>> extra/separate/new/non-standard way to do cross-unit/cross-fission type
>>>> references.
>>>>
>>>>
>>>> In the IR I’d /like/ to have a DIExternalTypeRef(DW_TAG_class_type,
>>>> !”_ZTC6TypeName”, !1) with !1 being a reference to either the DIModule or
>>>> the skeleton CU. Then the backend would emit the hash of the name if type
>>>> units are enabled (C++/gdb) or the mangled name (+ the accelerator table
>>>> entry) otherwise (ObjC and/or Darwin). If there is significant pushback to
>>>> the latter, I’d be willing to have the backend emit a hash in both cases
>>>> but we’d have to careful about what to exactly to hash for all the
>>>> aforementioned reasons.
>>>>
>>>
>>> I don't follow - if the mangled name is sufficient, then a hash of the
>>> mangled name should be.... what am I missing?
>>>
>>> Nothing, these are two separate issues:
>>>
>>> If you don't have an ODR to rely on, then a mangled name seems
>>> insufficient just as the hash would be.
>>>
>>>
>>> The decision for mangled name vs hash is motivated by the mangled name
>>> also doubling as a key to look up the type in the AST.
>>> The other problem is (partially) solved by the accelerator table entry
>>> that associates the mangled name with a module. I’m starting to think now
>>> that it might be better to include a fission-style forward declaration +
>>> decl context into the TAG_module instead. The DWARF-style decl context
>>> could in theory be smaller than the mangled name because two types could
>>> share common ancestors and then we could emit the same hash of the mangled
>>> names as we do for type units. But let’s discuss this when it comes up
>>> (together with the patch that makes use of DIExternalTypeRef).
>>>
>>> -- adrian
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150824/e8f89ea5/attachment-0001.html>