[cfe-dev] RFC: CodeView debug info emission in Clang/LLVM

Thu Oct 29 12:42:22 PDT 2015

I am really excited to see the work for generating CodeView done.

I have two questions:
1. Will the CodeView information be publicly documented?
2. Will LLD and LLDB be updated as necessary to support CodeView?

On Thu, Oct 29, 2015 at 10:11 AM, Dave Bartolomeo via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> *RFC: CodeView debug info emission in Clang/LLVM*
>
>
>
> *Overview*
>
> On Windows, the de facto debug information format is CodeView, most
> commonly encountered in the form of a .pdb file. This is the format emitted
> by the Visual C++, C#, and VB.NET compilers, consumed by the Visual
> Studio debugger and the Windows debugger (WinDbg), and exposed for
> read-only access via the DIA SDK. The CodeView format has never been
> publically documented, and Microsoft has never provided an API for emitting
> CodeView info for native code. Therefore, Clang and LLVM have only been
> able to emit the small subset of CodeView information that the community
> has been able to reverse engineer.
>
>
>
> In order to improve the experience of using Clang and other LLVM-based
> compilers to target Windows, Microsoft has decided to contribute code to
> the LLVM project to read and write CodeView debug information, including
> changes to make Clang and LLVM emit CodeView debug information for C and
> C++ code. This RFC covers the first phase of this work: Emitting CodeView
> type information for C and C++. The next phase will be to emit CodeView
> symbol information for functions and their local variables; I’ll send out a
> separate RFC for that when I get to that phase.
>
>
>
> I’ll start with some background on the CodeView format, and then move on
> to the proposed design.
>
>
>
> *Overview of the CodeView Debug Information Format*
>
> “CodeView” is the name we use to refer to the debug record format
> generated by the Visual C++ compiler and consumed by the Visual Studio
> debugger, the Windows debugger (WinDbg), and the DIA SDK. CodeView records
> are contained in either a .pdb file or in an object file. The CodeView
> records that describe the debug information for a PE image (i.e. a .dll or
> .exe) are always contained in a corresponding PDB file. The CodeView
> records that describe the debug information for a COFF object file (.obj)
> are contained within the .obj itself, although some of the debug
> information will be stored in a .pdb file if the .obj was compiled with the
> /Zi or /ZI option.
>
>
>
> When code is compiled with cl.exe using the /Z7, /Zi, or /ZI option,
> cl.exe generates two well-known sections in the resulting .obj file:
> “.debug$T” and “.debug$S”. These are known as the “types” section and the
> “symbols” section, respectively. The types section contains CodeView
> records that describe all of the data types referenced by symbols in that
> .obj. The symbols section contains CodeView records that describe all of
> the symbols defined within the .obj, including functions, global and static
> data, and local variables. When link.exe is invoked with the /debug option,
> all of the debug information from the contributing .obj files is combined
> into a single .pdb file for the linked image.
>
>
>
> *The .debug$T Section*
>
> The types section of the .obj file contains a short header consisting
> solely of the version number of the CodeView types format (currently equal
> to 4), followed by a sequence of CodeView type records. Each type record
> starts with a 16-bit field holding the length of the record, followed by a
> 16-bit tag field that identifies the kind of type described by the record.
> The format of the remainder of the record depends on the tag. Common type
> record kinds include:
>
> -          Pointer
>
> -          Array
>
> -          Function
>
> -          Struct
>
> -          Class
>
> -          Union
>
> -          Enum
>
>
>
> Duplicate type records are folded based on a binary comparison of their
> contents. Thus, there will be only a single instance of the type record for
> ‘const char*’ in a given types section, regardless of the number of uses of
> that type.
>
> When one type record needs to refer to another type record (e.g. a Pointer
> record referring to the record that describes the referent type of the
> pointer), it uses a 32-bit “type index”, usually abbreviated “TI”. A TI
> with a value less than 0x1000 refers to a well-known type for which no type
> record actually exists. Examples include primitive types like ‘int’ or
> ‘wchar_t’, and simple pointers to these primitive types. A TI with a value
> of 0x1000 or greater refers to the another type record in the types
> section, whose zero-based index is determined by subtracting 0x1000 from
> the value of the TI. It is an invariant of the types section that a given
> type record may only use a TI to refer to type records defined earlier in
> the types section. Thus, no cycles are possible. In order to support types
> with cyclic dependencies, user-defined types (class, struct, union, enum)
> can have two records for each type: one to describe the forward
> declaration, and one to describe the definition. Other records refer to the
> forward declaration of the type, and only the definition record contains
> the member list of the type. The debugger matches a forward declaration
> with its definition based on the qualified name of the type.
>
>
>
> Type indices are also used within the .debug$S section to refer to types
> in the .debug$T section.
>
>
>
> If a given .obj file was compiled with the /Zi or /ZI option, the type
> records for that .obj are stored in a separate .pdb file, rather than in
> the .obj file itself. The records in the PDB have exactly the same format
> as those in the .obj, so there is essentially no functional difference in
> the debug info itself.
>
>
>
> When the linker generates the .pdb for an image, it creates a single types
> section in the .pdb consisting of the transitive closure of all of the type
> records referenced by any symbol in any of the contributing .objs, with any
> type indices suitably fixed up to refer to the correct record in the merged
> types section.
>
>
>
> *The .debug$S Section*
>
> The symbols section of the .obj file contains several substreams to
> describe the symbols defined in that .obj. The most common substreams are:
>
> -          Line Numbers: Contains mappings from code address ranges to
> source file, line, and column.
>
> -          Source File Info: Contains the file names and file hashes of
> source files referenced in the Line Numbers stream.
>
> -          Symbols: Contains symbol records that describe functions and
> variables.
>
>
>
> The Symbols substream is a sequence of records that, like the type
> records, each begin with a 16-bit size and a 16-bit tag. Common symbol
> record kinds include:
>
> -          Global Data
>
> -          Function
>
> -          Block Scope
>
> -          Stack Frame
>
> -          Frame Pointer-Relative Variable
>
> -          Register-Relative Variable
>
> -          Enregistered Variable
>
>
>
> Unlike type records, some symbol records can be nested. For example,
> Function records usually contain a Stack Frame record, local variable
> records, and Block Scope records. Block Scope records can in turn contain
> more local variable and Block Scope records.
>
>
>
> When a symbol record needs to refer to a data type, it uses a TI that
> refers to a record in the types section for the .obj.
>
>
>
> When the linker generate the .pdb for an image, it creates a separate
> symbols section in the .pdb for each contributing .obj. The contents of the
> .obj’s symbols section are copied into the corresponding section in the
> .pdb, fixing up any TIs to refer to the types section of the .pdb, and
> fixing up any code or data addresses to refer to the correct location in
> the final linked image.
>
>
>
> *Proposed Design*
>
> *How Debug Info is Generated*
>
> The CodeView type records for a compilation unit will be generated by the
> front-end for the source language (Clang, in the case of C and C++). The
> front-end has access to the full type system and AST of the language, which
> is necessary to generate accurate debug type info. The type records will be
> represented as metadata in the LLVM IR, similar to how DWARF debug info is
> represented. I’ll cover the actual representation in a bit more detail
> below.
>
> The LLVM back-end will be responsible for emitting the CodeView type
> records from the IR into the output .obj file. Since the type records will
> already be in the correct format, this is essentially just a copy. No
> inspection of the type records is necessary within LLVM. The back-end will
> also be responsible for generating CodeView symbol records, line numbers,
> and source file info for any functions and data defined in the compilation
> unit. The back-end is the logical place to do this because only the
> back-end knows the code addresses, data addresses, and stack frame layouts.
>
>
>
> *Representation of CodeView in LLVM IR*
>
> DICompileUnit
>
> + e*xisting fields*
>
> + CodeViewTypes : DICodeViewTypes
>
>
>
> DICodeViewTypes
>
> + TypeRecords : MDString[]
>
> + UDTSymbols : DICodeViewUDT[]
>
>
>
> DICodeViewUDT
>
> + Name : MDString
>
> + TypeIndex : uint32_t
>
>
>
> DIVariable
>
> + *existing fields*
>
> + TypeIndex : uint32_t
>
>
>
> DISubprogram
>
> + *existing fields*
>
> + TypeIndex : uint32_t
>
> The existing DICompileUnit node will have a new operand named
> CodeViewTypes, which points to the new DICodeViewTypes node that describes
> the CodeView type information for the compilation unit.
>
>
>
> The DICodeViewTypes node contains two operands:
>
> -          TypeRecords, an array of MDStrings containing the actual
> CodeView type records for the compilation unit, sorted in ascending order
> of type index.
>
> -          UDTSymbols, and array of DICodeViewUDT nodes describing the
> user-defined types (class/struct/union/enum) for which CodeView symbol
> records will need to be emitted by the back-end.
>
>
>
> The DICodeViewUDT node contains two operands:
>
> -          Name, an MDString with the name of the symbol as it should
> appear in the CodeView symbol record.
>
> -          TypeIndex, a uint32_t holding the CodeView type index of the
> type record for the user-defined type’s definition.
>
>
>
> The DICodeViewUDT nodes are necessary because they are generally the only
> references to the definition of the user-defined type. Other uses of that
> type refer to the forward declaration record for the type, and without a
> reference to the definition of the type, the linker will discard the
> definition record when it merges the type information into the PDB.
>
>
>
> To specify the CodeView type for a variable or function, the DIVariable
> and DISubprogram nodes will have an additional TypeIndex operand containing
> the type index of the type record for that variable or function’s type.
> This operand will be set to zero when CodeView debug info is not enabled.
>
>
>
> The above representation essentially extends the existing DWARF-focused
> debug metadata to also include CodeView info. This was the least invasive
> way I found to add CodeView support, but it may not be the right
> architectural decision. It would also be possible to have the CodeView
> metadata entirely separate from the DWARF metadata. This would reduce the
> size of the IR when only one form of debug information was being emitted,
> which is presumably the common case. However, I expect it would complicate
> the scenario where both DWARF and CodeView are being emitted; for example,
> would having two dbg.declare intrinsics for a single local variable confuse
> existing consumers of LLVM IR? I’m hoping someone more familiar with the
> existing debug info architecture can provide some guidance here if there’s
> a better way of doing this.
>
>
>
> *New Library - LLVMCodeView*
>
> The design introduces a new library in LLVM, “LLVMCodeView”. This library
> will contain the code to read and write the CodeView debug info format. The
> library depends only on the LLVMSupport library, enabling non-LLVM clients
> to use the library without depending on large portions of LLVM. The
> LLVMCodeView library is *not* responsible for translating other forms of
> information (e.g. LLVM IR, Clang ASTs) to the CodeView format; that work
> happens in other components.
>
>
>
> *Changes to LLVMCore*
>
> The LLVMCore library will be extended with the definitions of the new
> debug metadata nodes and new fields on existing nodes, as described
> previously.
>
>
>
> *Generating CodeView Type Records in Clang*
>
> The clangCodeGen library will be extended with a new class,
> CodeViewTypeTable. This class is the CodeView equivalent of CGDebugInfo for
> CodeView. It translates Clang types into the appropriate CodeView type
> record on demand, returning the type index of the new record. This is where
> most of the interesting work happens. Since all of the type records for a
> given image are merged together by the linker when creating the final .pdb,
> having the type records emitting by Clang match those emitted by cl.exe as
> closely as possible minimizes conflicts when object files built by the two
> compilers are linked together into the same image.
>
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151029/d127acb4/attachment.html>