[LLVMdev] First-class debug info IR: MDLocation

Fri Oct 24 16:16:28 PDT 2014

I've attached a preliminary patch for `MDLocation` as a follow-up to the
RFC [1] last week.  It's not commit-ready -- in particular, it squashes
a bunch of commits together and doesn't pass `make check` -- but I think
it's close enough to indicate the direction and work toward consensus.

[1]: http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-October/077715.html

IMO, the files to focus on are:

    include/llvm/IR/DebugInfo.h
    include/llvm/IR/DebugLoc.h
    include/llvm/IR/Metadata.h
    include/llvm/IR/Value.h
    lib/AsmParser/LLLexer.cpp
    lib/AsmParser/LLParser.cpp
    lib/AsmParser/LLParser.h
    lib/AsmParser/LLToken.h
    lib/Bitcode/Reader/BitcodeReader.cpp
    lib/Bitcode/Writer/BitcodeWriter.cpp
    lib/Bitcode/Writer/ValueEnumerator.cpp
    lib/Bitcode/Writer/ValueEnumerator.h
    lib/IR/AsmWriter.cpp
    lib/IR/AsmWriter.h
    lib/IR/DebugInfo.cpp
    lib/IR/DebugLoc.cpp
    lib/IR/LLVMContextImpl.cpp
    lib/IR/LLVMContextImpl.h
    lib/IR/Metadata.cpp

Using `Value` instead of `MDNode`
=================================

A number of APIs expect `MDNode` -- previously, the only referenceable
type of metadata -- but this patch (and the ones that will follow) have
referenceable metadata that *do not* inherit from `MDNode`.  Metadata
APIs such as `Instruction::getMetadata()` and
`NamedMDNode::getOperand()` need to return non-`MDNode` metadata.

I plan to commit the API changes incrementally so we can fix any issues
there before pushing the functionality changes.  Unfortunately, this
currently adds a lot of noise to the (squashed) patch.

Introducing `MDLocation`
========================

Of course, this adds `MDLocation`, the first subclass of `MDUser`.  This
is a first-class IR type that has two other representations:
`DILocation` (which now trivially wraps `MDLocation` instead of
`MDNode`) and `DebugLoc`.

I've genericised the code in `LLParser` (and elsewhere) to sketch out
how adding other `MDUser` subclasses will go.  Perhaps I used the wrong
axis, but we can adjust it as we go.

Usage examples:

    !6 = metadata MDLocation(line: 43, column: 7, scope: !4)
    !7 = metadata MDLocation(scope: !5, line: 67, inlinedAt: !6)

The fields can be listed in any order.  The `scope:` field is required,
but the others are optional (`line:` and `column:` default to `0`,
`inlinedAt:` defaults to `null`).

(Note that in the RFC I referred to this as an `MDLineTable`, but
`MDLocation` is a better name.  If/when this work supersedes the
`DIDescriptor` hierarchy, it'll likely get renamed to `DILocation`, but
for now there's a name clash.)

Where this is heading
=====================

Let's look at a concrete example.  Here's some simple C++ code:

    $ cat t.h
    struct T { short a; short b; };
    $ cat foo.cpp
    #include "t.h"
    int foo(T t) { return t.a + t.b; }
    $ cat bar.cpp
    #include "t.h"
    int foo(T t);
    int bar(T t) { return foo(t) * 2; }

Looking forward, after refactoring ownership and uniquing and fixing up
a few schema issues, I'd expect the above to link into something like
the following:

    !0 = metadata DIFile(filename: "foo.cpp", directory: "/path/to")
    !1 = metadata DIFile(filename: "./t.h", directory: "/path/to")
    !2 = metadata DIFile(filename: "bar.cpp", directory: "/path/to")
    !3 = metadata DIBaseType(name: "short", size: 16, align: 16)
    !5 = metadata DIBaseType(name: "int", size: 32, align: 32)
    !6 = metadata DICompositeType(tag: 0x13, name: "T", uniqued: "_ZTS1T",
                                  file: !1, line: 1, size: 32, align: 16)
    !7 = metadata DIMember(line: 1, file: !1, type: !3,
                           name: "a", size: 16, align: 16, context: !6)
    !8 = metadata DIMember(line: 1, file: !1, type: !3,
                           name: "b", size: 16, align: 16, context: !6)
    !9 = metadata DISubroutineType(args: [ !5, !6 ])
    !10 = metadata DICompileUnit(file: !0, language: 4, kind: FullDebug,
                                 producer: "clang version 3.6.0 ",
                                 retainedUniqueTypes: [ !6 ])
    !11 = metadata DISubprogram(name: "foo", linkageName: "_Z3foo1T",
                                handle: i32(i32)* @_Z3foo1T, file: !0,
                                type: !9, context: !10)
    !12 = metadata DIArgVariable(name: "t", arg: 1, line: 2, type: !6,
                                 context: !11)
    !13 = metadata DILocation(line: 2, column: 11, scope: !11)
    !14 = metadata DILocation(line: 2, column: 16, scope: !11)
    !15 = metadata DICompileUnit(file: !2, language: 4, kind: FullDebug,
                                 producer: "clang version 3.6.0 ",
                                 retainedUniqueTypes: [ !6 ])
    !16 = metadata DISubprogram(name: "bar", linkageName: "_Z3bar1T",
                                handle: i32 (i32)* @_Z3bar1T, file: !2,
                                type: !9, context: !15)
    !17 = metadata DIArgVariable(name: "t", arg: 2, line: 3, type: !6,
                                 context: !16)
    !18 = metadata DILocation(line: 3, column: 11, scope: !16)
    !19 = metadata DILocation(line: 3, column: 23, scope: !16)

Notice that only the links to parents (i.e., `context:`) are explicit
here -- backlinks are implied.  For example, !7 and !8 point to !6, but
not the reverse.

This has the interesting property of removing all cycles from
serialization (assembly and bitcode).

Making debug info assembly readable and writable
================================================

Moreover, we're now in a place where it's trivial to express the
"context" pointer structurally.  Here's the same debug info as above,
using syntactic sugar to fill the "context" pointers:

    !0 = metadata DIFile(filename: "foo.cpp", directory: "/path/to")
    !1 = metadata DIFile(filename: "./t.h", directory: "/path/to")
    !2 = metadata DIFile(filename: "bar.cpp", directory: "/path/to")
    !3 = metadata DIBaseType(name: "short", size: 16, align: 16)
    !5 = metadata DIBaseType(name: "int", size: 32, align: 32)
    !6 = metadata DICompositeType(tag: 0x13, name: "T", uniqued: "_ZTS1T",
                                  file: !1, line: 1, size: 32, align: 16) {
      !7 = metadata DIMember(line: 1, file: !1, type: !3,
                             name: "a", size: 16, align: 16)
      !8 = metadata DIMember(line: 1, file: !1, type: !3,
                             name: "b", size: 16, align: 16)
    } ; !6
    !9 = metadata DISubroutineType(args: [ !5, !6 ])
    !10 = metadata DICompileUnit(file: !0, language: 4, kind: FullDebug,
                                 producer: "clang version 3.6.0 ",
                                 retainedUniqueTypes: [ !6 ]) {
      !11 = metadata DISubprogram(name: "foo", linkageName: "_Z3foo1T",
                                  handle: i32(i32)* @_Z3foo1T, file: !0,
                                  type: !9) {
        !12 = metadata DIArgVariable(name: "t", arg: 1, line: 2, type: !6)
        !13 = metadata DILocation(line: 2, column: 11)
        !14 = metadata DILocation(line: 2, column: 16)
      } ; !11
    } ; !10
    !15 = metadata DICompileUnit(file: !2, language: 4, kind: FullDebug,
                                 producer: "clang version 3.6.0 ",
                                 retainedUniqueTypes: [ !6 ]) {
      !16 = metadata DISubprogram(name: "bar", linkageName: "_Z3bar1T",
                                  handle: i32 (i32)* @_Z3bar1T, file: !2,
                                  type: !9) {
        !17 = metadata DIArgVariable(name: "t", arg: 2, line: 3, type: !6)
        !18 = metadata DILocation(line: 3, column: 11)
        !19 = metadata DILocation(line: 3, column: 23)
      } ; !16
    } ; !15

This assembly has the following advantages over the status quo:

  - Fields are named.  Aside from readability, this prevents
    adding/reordering fields in the schema from requiring testcase
    updates.

  - Serialization graph becomes a DAG.  Aside from readability, this
    removes most RAUW from assembly (and all RAUW from bitcode).

  - Structure is clear.

Bike sheds to paint
===================

 1. Should we trim some boilerplate?  E.g., it would be trivial to
    change:

        !6 = metadata MDLocation(line: 43, column: 7, scope: !4)

    to:

        !6 = MDLocation(line: 43, column: 7, scope: !4)

    This would not complicate `LLParser`.  Thoughts?

 2. Which of the two "end goal" syntaxes is better: flat, or
    hierarchical?  Better for what?  Why?

    The flat one might be better for FileCheck-ing (not sure), but IMO
    the hierarchical one is much saner for us humans, and that's the
    main point of assembly.  It wouldn't be hard to default to one and
    write the other based on a command-line flag -- is that a good idea?

 3. Assembly syntax is pretty easy to change, so this doesn't have to be
    perfect now.  Nevertheless, is there a magical syntax that would be
    easier to read/write/FileCheck?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: MDLocation-preview.patch
Type: application/octet-stream
Size: 608893 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/2296c85c/attachment.obj>
-------------- next part --------------