[PATCH] Have clang list the imported modules in the debug info

Tue May 5 20:07:45 PDT 2015

On Tue, May 5, 2015 at 5:00 PM, Adrian Prantl <aprantl at apple.com> wrote:

>
> On May 5, 2015, at 3:14 PM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
>
> On Tue, May 5, 2015 at 3:01 PM, Adrian Prantl <aprantl at apple.com> wrote:
>
>>
>> > On May 5, 2015, at 9:59 AM, David Blaikie <dblaikie at gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, May 4, 2015 at 4:46 PM, Adrian Prantl <aprantl at apple.com>
>> wrote:
>> >
>> >>> On May 4, 2015, at 1:31 PM, David Blaikie <dblaikie at gmail.com> wrote:
>> >>>
>> >> ...
>> >>>> >>> So you're going to need to implement fission (to at least some
>> degree) support in LLDB, then? (to support the case where you haven't
>> linked debug info with llvm-dsymutil, but you've hit one of these lookup
>> problems where you need to cross possibly-conflicting modules)
>> >>>> >>
>> >>>> >> Yes. Specifically, it won’t support type units, and it will look
>> up types by name rather than by signature. (cf. the second part of
>> http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150427/128278.html
>> )
>> >>>> >
>> >>>> > How are you going to reference the types in the module's fission
>> CU without type units/signatures? Are you going to emit type declarations
>> into the normal CU and rely on the debugger to know that these declarations
>> can be resolved by looking elsewhere? (just without the benefit of
>> constraining that search to just looking for a matching TU?)
>> >>>>
>> >>>> If you look at the example in
>> http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150427/128278.html,
>> there will be an external type index (using the usual accelerator table
>> format) that maps an external type’s UID to a pcm. In the pcm there is an
>> extra accelerator table entry that maps UID to DIE offset.
>> >>>
>> >>> I mean I guess that's up to you, but seems like a relatively large
>> workaround compared to supporting type units... (I mean certainly seems
>> like strictly less work to do the workaround than implementing type units
>> in LLDB, but a relatively large amount of work to do/throw away eventually
>> once LLDB supports type units)
>> >>
>> >> It’s not primarily meant to be a workaround so LLDB doesn’t need to
>> implement type units; the UIDs double as the key (decl context + name) to
>> import types directly from the AST. The other advantage is that we won’t
>> have to worry about MD5 hash collisions, but that’s more of a theoretical
>> advantage.
>> >
>> > If there's a concern about collisions, we should fix that regardless
>> (because we'll have the same problem with normal type units or modularized
>> type units).
>>
>> My understanding is that split DWARF is disambiguating hash collisions by
>> adding the full decl context and name to both the definition and the
>> forward declaration:
>>
>
> No, that's not a requirement. It's just an aspect of our implementation so
> we can put entities in the type when needed (such as for putting member
> function declarations in there to refererence from out of line member
> function definitions)
>
> In  the simplest case, you can reference a type unit directly:
>
> DW_TAG_variable
>   DW_AT_type [DW_FORM_ref_sig8] ....
>
>
> You’re right. The standard allows this, although all the examples in the
> appendix show the long form that LLVM also emits.
>
>
> Without any DW_TAG_structure_type/namespaces/etc. Once we have Bag O'
> DWARF, we'd move to that representation almost entirely. (the signature
> we're using for this is the mangled name of the type so it includes the
> decl context in the hash anyway)
>
>
> .. which, incidentally, will look a lot like this UID example, right?
>
> (for completeness, the UID type reference would look like this:
>>  DW_TAG_variable
>>    DW_AT_name “myC”
>>    DW_AT_type [DW_FORM_strp] “_ZTN1AS1BS1C” )
>>
>>
> Oh I see, you mean hashing the mangled name and using it as a signature
> rather than emitting the name directly.
>

Yes, that's what LLVM does for type units today.

> But then, to disambiguate, it would still be beneficial to have the
> mangled name available.
>

If we believe hash collisions to be an issue - then type units are broken
and I'd like to figure out how to fix them, then use the same feature here.

>
>
>>
>> So I think that hash collisions are actually handled for normal fission
>> and we don’t need to worry about them. The bag o’dwarf will still contain
>> the full decl context in the DWO, so it will be fine there, too.
>>
>> >
>> > If we've already got (& have to use in the case of DWARF fallback for
>> modules) to support a hash or other numeric identifier, it might be good to
>> use that for everything rather than having two mechanisms?
>>
>> Generally, one way to do it is of course better than maintaining two :-)
>> But using the UIDs in the .o file allows us to elide the the forward
>> declarations from the .o file
>
>
> It doesn't, though - because we need the forward declarations to put
> members into (such as member function definitions - this'll come up even
> for modules quite commonly with any inline member function, for example -
> its definition will appear in the .o file and the definition
> DW_TAG_subprogram will need to DW_AT_specification pointing to the
> declaration of the member function - which is a DW_TAG_subprogram inside
> the DW_TAG_class that is a declaration with a signature that refers to the
> type unit).
>
>
> Couldn’t we make the DW_AT_specification an external type reference
> pointing to the full type in the module?
>

No, because it needs to point to the specific function declaration, not
just the type it's a member of.

>
>
> Only once we've got Bag O' DWARF do we remove the need for these
> declarations.
>
> Granted we can get rid of them in /some/ cases today - when there's no
> need to reference members in the type (or nest other members - like a
> nested class, for example - though we could possibly workaround even that
> with Bag O DWARF by having the type in the type unit have a declaration of
> the nested type - then reference that from the nested type definition in
> another type unit or .o file, etc).
>
> I don't think there's anything you can leverage here that's different from
> what type units need. If you're going to do anything different from the way
> type units work today, I'd really like to discuss that further and
> understand what's different here that benefits from/allows for that
> difference that doesn't apply to type units themselves.
>
>
>> (granted the UIDs carry the same information, so it’s not actually as
>> small as it looks like) so all that dsymutil needs to do is replace the
>> string references with FORM_ref_addrs, no further stripping needed.
>>
>
> Good point. I don’t have any other secret plans up my sleeve apart from
> what I already outlined in this thread :-)
> I see the UID type references for LLDB like a bag o’dwarf version 0.5.
>

But type units are already Bag O Dwarf 0.5 (0.2, perhaps - I take it you
plan to put all your types in one blob, rather than multiple), that was
what I was suggesting when we started this. The difference between a type
unit and a Bag O Dwarf is the ability to have multiple hash->DIE offsets in
the unit header.

The point of using type units was to use an existing standard construct -
then extend it to support more use cases (and improve the non-modules
situation at the same time - as you & Greg noticed, type units are rather
verbose in part due to their duplication, and the duplication they force on
the CU that references them when needing to refer to specific parts of the
type)

> Any lessons learned there (like the hash collisions) should definitely be
> addressed in the final bag o’dwarf design.
>
>
>>
>> >
>> >>
>> >>>
>> >>>
>> >>> >>
>> >>> >>
>> >>> >>>
>> >>> >>> OK, so I think it's probably reasonable for now to just add
>> DW_TAG_modules to the CU for each referenced module (or does it have to be
>> each referenced submodule? (can two submodules within a single module be
>> contradictory/conflicting?)). Since we don't have any good way to reference
>> the module is a foreign unit while deduplicating that unit... there's not
>> much point having the imported_module - but if you think it adds anything,
>> I'm open to ideas.
>> >>> >> It could help keeping things simpler.
>> >>> >> Emitting it doesn’t add much semantic value because module imports
>> always occur at the top level, but it will make the transition to the
>> deduplicated TAG_modules easier — It could be easier to teach consumers
>> once about imported_module({ref to TAG_module}) rather than having them
>> also recognize top-level TAG_modules as an intermediate step. It’s also
>> slightly easier to implement in LLVM because the imported_module allows us
>> to anchor the TAG_module in the CU, but that’s not a very strong argument.
>> >>> >
>> >>> > Agreed on all counts (not a strong argument, but convenient enough,
>> etc, etc).
>> >>> >
>> >>> > I'm still not entirely sure what the right answer is here, though,
>> which is why I'm hesitant to bake anything in too strongly.
>> >>> >
>> >>> > To come back to one of the outstanding questions: Do you need
>> submodule import information, or just module level (if modules cannot have
>> internal conflicts and you can't avoid cross-module conflicts just by lack
>> of visibility (I have no idea if either of those things are true) then you
>> may just need per-module not per-submodule info)?
>> >>>
>> >>> At the moment I do not think that it makes sense for two submodules
>> to conflict, but there is nothing in the clang documentation that
>> explicitly forbids this. With this in mind, I think it is reasonable to not
>> support submodules (at least initially) and always emit an import for the
>> parent module.
>> >>> Thats what I wanted to write ... but I as I’m browsing through our
>> documentation,
>> http://clang.llvm.org/docs/Modules.html#conflict-declarations explicitly
>> gives an example of two conflicting submodules, so maybe this is not a
>> reasonable simplification after all. On the other hand, a quick grep over
>> all system module maps on OS X doesn’t show a single conflict declaration.
>> >>>
>> >>> I still believe we do not need to support submodules right from the
>> start, but we should have a story for getting there if we need to.
>> >>>
>> >>> Given the simple example that demonstrates the possibility, it seems
>> fair to have a story for what that looks like, yes - even if a first
>> pass/prototype doesn't support it.
>> >>
>> >> Sure.
>> >>>
>> >>>
>> >>> >
>> >>> > Also, does each submodule need different special attributes/flags?
>> If the special codegen attributes you want are at the module level, it'd
>> probably be best to keep those on the Skeleton CU for the module (that will
>> be comdat folded, etc, on ELF - and they could be DWARF-aware deduplicated
>> by llvm-dsymutil) so they're not duplicated. The DW_TAG_module would then
>> just have a DW_AT_signature attribute or something similarly small/trivial
>> to point to the skeleton CU.
>> >>>
>> >>> The attributes are derived from cc1 command line arguments. Not two
>> submodules imported by one CU can have different attributes. All submodules
>> in a pcm also share their attributes. Putting them into the skeleton CU
>> appears to be the most efficient place to put them, though perhaps not the
>> most logical one.
>> >>>
>> >>> Why not the most logical? It'd be nice if it were a DW_TAG_module
>> instead of a DW_TAG_compile_unit - but given the limited vocabulary we have
>> in DWARF top level tags, it seems as good as we can have.
>> >>
>> >> I tend to view the module configuration (include path, isysroot,
>> configuration macros) to be a part of the module and not a part of the
>> skeleton that points to the split debug info for that module.
>> >
>> > Sure, it's a workaround for the lack of Bag O' DWARF for now, one way
>> or another. Either way the debugger's going to have to jump the
>>
>> [there’s something missing towards the end]
>> No it’s not a workaround. Having the full module configuration is what
>> allows LLDB to rebuild the module if the module cache has been cleared.
>> >
>> >> A module is uniquely identified by name + configuration. That’s why I
>> feel it should be part of the tag that also holds the name.
>> >>
>> >>>
>> >>> I would prefer to stick the attributes on the (top-level)
>> DW_TAG_module and later deduplicate the attributes together with the
>> DW_TAG_module. Sticking them on the skeleton won’t save any space in the .o
>> files and would save 3*4-8=4 bytes (3x FORM_strp for include, macro, and
>> isysroot - 1x FORM_ref_sig_8) per CU and imported module.
>> >>>
>> >>> Seems nicer not to duplicate them, especially since not everyone will
>> be using a debug-aware linker like llvm-dsymutil (LLDB on Windows or Linux
>> won't have that convenience). Eventually we can use Bag O' DWARF for the
>> skeleton CU, make it a DW_TAG_module (with more DWARF changes to allow that
>> as a top-level tag, if desired/useful - I'm not sure it adds a lot) and
>> have the imported_module reference it that way. (DW_TAG_imported_module,
>> DW_AT_import, DW_FORM_ref_sig8)
>> >>>
>> >>> I'm not /hugely/ invested in this, but we do have people caring about
>> LLDB on Linux and Windows, so avoiding tying the LLDB story to MachO and
>> dsymutil, etc, seems valuable.
>> >>
>> >> I think that this would be an unnecessary intermediate step that we
>> eventually want to migrate away from anyway. We already identified that the
>> good solution for deduplication is going to be a skeleton TAG_module, so my
>> view is that it is not worth the trouble adding a temporary indirection
>> (and a new attribute name)
>> >
>> > New attribute name?
>>
>> What attribute would we put into the TAG_module to refer to the skeleton
>> CU?
>>
>
> I imagine we'd just match by name in that instance.
>
>
> That doesn’t generally work. I could have a.c import Foo and b.c import
> Foo -DNDEBUG.
>

Is this a valid use of modules? It's not immediately obvious to me how we
reconcile that. Wouldn't the modules for those two things need to have
different file names (so they could exist simultaneously)? & if so, then
name matching should suffice..

> If the debugger loads the debug info from the linked .dSYM bundle it
> wouldn’t be able to tell which version of Foo to import. Actually, that’s a
> really good argument why the module configuration needs to sit inside the
> TAG_module.
>
>
> (& eventually use a sig8, etc)
>
>
> Unless we refer to it via ref_sig8 (and disregard collisions, which I
> think might be statistically fine for module configurations),
>

Right, can't do that - because ref_sig8 is only spec'd for type units, not
other things.

All I'm really driving at here is that module debug info can be phrased in
a way that works for DWARF5 debuggers like GDB today: type units and
fission. I'd like it to be implemented that way (certainly because it's
convenient for me - since we have GDB doing this already, but also it seems
like a nice/tidy principled idea based on existing standard features) and
then used as a platform to propose new/extended DWARF features that could
make it better. It seems things got muddied somewhere along the way from
our original discussions on this.

> which we already agreed to leave for the next revision.
>
> -- adrian
>
>
>
>>
>> >
>> >> to save 4 bytes in the intermediate step.
>> >
>> > The debugger's going to need to resolve the skeleton anyway (in the
>> case of non-AST based debugging) so I'm not sure how much it's an extra
>> step.
>>
>> My “intermediate step” was referring to the non-dedup’ed TAG_module that
>> is not (yet) part of the dedup’ed skeleton CU (but will be in the next
>> iteration). I just didn’t think that it is worth to optimize a
>> representation that will be changed soon anyway.
>>
>
> It doesn't seem to cost much, but I'll give up on this.
>
>
>
>>
>> >
>> >> I don’t actually think there is anything about the TAG_module design
>> tying this to either MachO or dsymutil, but let me know if you feel
>> otherwise.
>> >
>> > Sorry, what I was getting at was that with the Mach0/dsymutil/lldb
>> story you probably don't need to consult the skeleton debug info (actually
>> I forgot, you do - in the case where you need a name from a module that
>> might be incompatible (it's not referenced directly in this CU)) -
>> pre-dsymutil, you'd use the ASTs directly, post-dsymutil I expect you'll
>> have inlined all the debug info so there are no skeletons, etc, if I'm
>> understanding your design correctly.
>>
>> Right, we need all of it :-)
>>
>> -- adrian
>>
>> >
>> > - David
>> >
>> >>
>> >> -- adrian
>> >>>
>> >>> >
>> >>> > If you need submodule import lists, then each DW_AT_module
>> representing a submodule would have a name (anything else?) and the
>> signature refering to its module skeleton CU.
>> >>>
>> >>> What I’m envisioning is
>> >>>
>> >>> .debug_info:
>> >>>   DW_TAG_compile_unit
>> >>>     ...
>> >>>     DW_TAG_imported_module
>> >>>      // import FooSubA
>> >>>      DW_AT_import [DW_FORM_ref4] (0x60)
>> >>>
>> >>>     DW_TAG_module
>> >>>       DW_AT_name(“FooLib”)
>> >>>       DW_AT_LLVM_sysroot(“/“)
>> >>>       DW_AT_LLVM_include_dirs(“-I/path”)
>> >>>       DW_AT_LLVM_macros(“-DNDEBUG”)
>> >>> 0x60:
>> >>>       DW_TAG_module
>> >>>         DW_AT_name(“FooSubA”)
>> >>>         // need not be emitted if not referenced.
>> >>>         DW_TAG_module
>> >>>           DW_AT_name(“FooSubASubA”)
>> >>>
>> >>>       // need not be emitted if not referenced.
>> >>>       DW_TAG_module
>> >>>         DW_AT_name(“FooSubB”)
>> >>>
>> >>>
>> >>>
>> >>> -- adrian
>> >>
>> >>
>> >
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150505/01296441/attachment.html>