[cfe-dev] RFC: Up front type information generation in clang and llvm

Tue Mar 29 19:43:21 PDT 2016

On Tue, Mar 29, 2016 at 7:31 PM Peter Collingbourne <peter at pcc.me.uk> wrote:

> Thanks for sharing this. Mostly seems like a reasonable plan to me. A few
> comments below.
>
>
Thanks Peter!

> On Tue, Mar 29, 2016 at 6:00 PM, Eric Christopher via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hi All,
>>
>> This is something that's been talked about for some time and it's
>> probably time to propose it.
>>
>> The "We" in this document is everyone on the cc line plus me.
>>
>> Please go ahead and take a look.
>>
>> Thanks!
>>
>> -eric
>>
>>
>> Objective (and TL;DR)
>> =================
>>
>> Migrate debug type information generation from the backends to the front
>> end.
>>
>> This will enable:
>> 1. Separation of concerns and maintainability: LLVM shouldn’t have to
>> know about C preprocessor macros, Obj-C properties, or extensive details
>> about debug information binary formats.
>> 2. Performance: Skipping a serialization should speed up normal
>> compilations.
>> 3. Memory usage: The DI metadata structures are smaller than they were,
>> but are still fairly large and pointer heavy.
>>
>> Motivation
>> ========
>>
>> Currently, types in LLVM debug info are described by the DIType class
>> hierarchy. This hierarchy evolved organically from a more flexible
>> sea-of-nodes representation into what it is today - a large, only somewhat
>> format neutral representation of debug types. Making this more format
>> neutral will only increase the memory use - and for no reason as type
>> information is static (or nearly so). Debug formats already have a memory
>> efficient serialization, their own binary format so we should support a
>> front end emitting type information with sufficient representation to allow
>> the backend to emit debug information based on the more normal IR features:
>> functions, scopes, variables, etc.
>>
>> Scope/Impact
>> ===========
>>
>> This is going to involve large scale changes across both LLVM and clang.
>> This will also affect any out-of-tree front ends, however, we expect the
>> impact to be on the order of a large API change rather than needing massive
>> infrastructure changes.
>>
>> Related work
>> ==========
>>
>> This is related to the efforts to support CodeView in LLVM and clang as
>> well as efforts to reduce overall memory consumption when compiling with
>> debug information enabled;  in particular efforts to prune LTO memory usage.
>>
>>
>> Concerns
>> ========
>>
>>
>> We need a good story for transitioning all the debug info testcases in
>> the backend without giving up coverage and/or readability. David believes
>> he has a plan here.
>>
>> Proposal
>> =======
>>
>> Short version
>> -----------------
>>
>> 1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line
>> Table.
>> 2. Split the clang CGDebugInfo API into Types and Line Table to match.
>> 3. Add a LLVM DWARF emission library similar to the existing CodeView one.
>> 4. Migrate the Types API into a clang internal API taking clang AST
>> structures and use the LLVM binary emission libraries to produce type
>> information.
>> 5. Remove the old binary emission out of LLVM.
>>
>>
>> Questions/Thoughts/Elaboration
>> -------------------------------------------
>>
>> Splitting the DIBuilder API
>> ~~~~~~~~~~~~~~~~~~~~
>> Will DISubprogram be part of both?
>>    * We should split it in two: Full declarations with type and a slimmed
>> down version with an abstract origin.
>>
>> How will we reference types in the DWARF blob?
>>    * ODR types can be referenced by name
>>    * Non-odr types by full DWARF hash
>>    * Each type can be a pair(tuple) of identifier (DITypeRef today) and
>> blob.
>>    * For < DWARF4 we can emit each type as a unit, but not a DWARF Type
>> Unit and use references and module relocations for the offsets. (See below)
>>
>> How will we handle references in DWARF2 or global relocations for
>> non-type template parameters?
>>    * We can use a “relocation” metadata as part of the format.
>>    * Representable as a tuple that has the DIType and the offset within
>> the DIBlob as where to write the final relocation/offset for the reference
>> at emission time.
>>
>> Why break up the types at all?
>>    * To enable non-debug format aware linking and type uniquing for LTO
>> that won’t be huge in size. We break up the types so we don’t need to parse
>> debug information to link two modules together efficiently.
>>
>
> How do you plan to handle abbreviations? You wouldn't necessarily be able
> to embed them directly in the blob, as when doing LTO each compilation unit
> would have its own set of abbreviations. I suppose you could do something
> like treat them as a special sort of reference to an abbreviation table
> entry, or maybe pre-allocate in the frontend (but would complicate
> cross-frontend LTO) but curious what you have in mind.
>

Thanks for reminding me, I knew I was forgetting something I'd talked about
when writing all of this down. :)

Basically to handle abbreviations you can do them the similarly to types by
creating a blob with an index/hash/etc and then reference that as part of
the type tuple, e.g.:

$1 = { DIAbbrev: 0x1234, DIBlob: <blah> }
$2 = { DIType: <ID>, DIAbbrev: $1, DIBlob: <blah> }

and keep them uniqued during emission and remember to merge these as well
during module merge time.

>
> Any other concerns there?
>>    * Debug information without type units might be slightly larger in
>> this scheme due to parents being duplicated (declarations and abstract
>> origin, not full parents). It may be possible to extend dsymutil/etc to
>> merge all siblings into a common parent. Open question for better ways to
>> solve this.
>>
>
> When we were thinking about teaching the backend to produce blobs from IR
> metadata we were thinking about cases where the debug info emitter would
> discover special member functions during IR traversal. I guess since we're
> moving all of that to the frontend we can just ask the frontend directly
> which special members are needed on the class. That solves the problem for
> a single translation unit. But what do you plan to do in the multiple
> translation unit case where two TUs declare different special members on a
> class? Would it be fine to just emit the two definitions and let the
> debugger sort it out? I guess this is the type of thing that debuggers
> normally deal with in the non-LTO case, so I suppose so?
>

Pretty much. This is one area where I have... disagreements with the DWARF
committee and I don't think there's anything else we can do here. TBH right
now I think we'd have issues with type units and special member functions
since we're using ODR-ness to unique.

-eric

>
>
>> How should we handle DWARF5/Apple Accelerator Tables?
>>    * Thoughts:
>>    * We can parse the dwarf in the back end and generate them.
>>    * We can emit in the front end for the base case of non-LTO (with help
>> from the backend for relocation aspects).
>>    * We can use dsymutil on LTO debug information to generate them.
>>
>> Why isn’t this a more detailed spec?
>>    * Mostly because we’ve thought about the issues, but we can’t plan for
>> everything during implementation.
>>
>>
>> Future work
>> ----------------
>>
>> Not contained as part of this, but an obvious future direction is that
>> the Module linker could grow support for debug aware linking. Then we can
>> have all of the type information for a single translation unit in a single
>> blob and use the debug aware linking to handle merging types.
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>>
>
>
> --
> --
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160330/d0b6f8d1/attachment.html>