[llvm-dev] [RFC] llvm-diva - Debug Information Visual Analyzer

Enciso, Carlos via llvm-dev llvm-dev at lists.llvm.org
Sun Aug 9 21:50:53 PDT 2020


llvm-diva - Debug Information Visual Analyzer
Carlos Alberto Enciso, Sony Interactive Entertainment

LLVM supports multiple debug information formats (namely DWARF and CodeView)
in different binary formats (e.g. ELF, PDB, Mach-O). Understanding the mappings
between source code and debug information can be complex, and it's a problem
we've commonly encountered when triaging debug information issues.

The output from tools such as llvm-dwarfdump or llvm-readobj use a close
representation of the internal debug information format and in our experience
we've found that they require a good knowledge of those formats to understand
the output, limiting who can triage and address such issues quickly. Even for
the experts, it can sometimes take a lot of time and effort to triage issues
due to the inherent complexity.

=========
llvm-diva
=========

At Sony, we've been developing an LLVM-based debug information analysis tool
which we've called llvm-diva (short for LLVM debug information visual analyzer),
designed to visualize these mappings. It's based entirely on the existing LLVM
libraries for debug info parsing, target support, etc. and at this stage we
believe that its proven its worth internally to the point where we would like
to propose upstreaming it as part of the mainline LLVM project alongside
existing tools such as llvm-dwarfdump.

llvm-diva is a command line tool that process debug info contained in a binary
file produces a debug information format agnostic "Logical View", which is a
high-level semantic representation of the debug info, independent of the
low-level format.

The logical view is composed of the tradition programming elements as: scopes,
types, symbols, lines. These elements can display additional information, such
as variable coverage factor, lexical block level, disassembly code, code
ranges, etc.

The diversity of llvm-diva command line options enables the creation of very
rich logical views to include more low-level debug information:
disassembly code associated with the debug lines, variables runtime location
and coverage, internal offsets for the elements within the binary file, etc.

With llvm-diva, we aim to address the following points:

* Which variables are dropped due to optimization?

* Why I can't stop at a particular line?

* Which lines are associated to a specific code range?

* Does the debug information represent the original source?

* What is the semantic difference between the debug info generated by different
  toolchain versions?

=============
Printing Mode
=============

In this mode llvm-diva prints the logical view or portions of it, based on
criteria patterns (including regular expressions) to select the kind of logical
elements to be included in the output.

The below example is used to show different output generated by llvm-diva.
We then compiled it for an x86 elf target with a recent version of clang (-O0
-g):

1  using INTPTR = const int *;
2  int foo(INTPTR ParamPtr, unsigned ParamUnsigned, bool ParamBool) {
3    if (ParamBool) {
4      typedef int INTEGER;
5      const INTEGER CONSTANT = 7;
6      return CONSTANT;
7    }
8    return ParamUnsigned;
9  }

Print basic details
-------------------

The following command prints basic details for the all logical elements sorted
by the debug information internal offset; it includes its lexical level. Each
row represents some element that is present within the debug information. The
first column represents the scope level, followed by the associated line number
(if any), and finally the description of the element.

llvm-diva --sort=offset
          --attribute=level
          --print=scopes,symbols,types,lines
          test.o

Logical View:

[000]           {File} 'test.o'
[001]             {CompileUnit} 'test.cpp'
[002]     2         {Function} extern not_inlined 'foo' -> 'int'
[003]     2           {Parameter} 'ParamPtr' -> 'INTPTR'
[003]     2           {Parameter} 'ParamUnsigned' -> 'unsigned int'
[003]     2           {Parameter} 'ParamBool' -> 'bool'
[003]                 {Block}
[004]     5             {Variable} 'CONSTANT' -> 'const INTEGER'
[004]     5             {Line}
[004]     6             {Line}
[003]     4           {TypeAlias} 'INTEGER' -> 'int'
[003]     2           {Line}
[003]     3           {Line}
[003]     8           {Line}
[003]     8           {Line}
[003]     9           {Line}
[002]     1         {TypeAlias} 'INTPTR' -> '* const int'
[002]     9         {Line}

Looking at the output we can see that it shows the semantics of the debug
information but decoupled from the underlying DWARF representation.

On closer inspection, we can see what could be a potential debug issue:

[003]                 {Block}
[003]     4           {TypeAlias} 'INTEGER' -> 'int'

The 'INTEGER' definition is at level [003], the same lexical scope as the
anonymous {Block} ('true' branch for the 'if' statement) whereas in the
original source code the typedef statement is clearly inside that block, so the
'INTEGER' definition should also be at level [004] inside the block.

Select logical elements
-----------------------

This feature allow selecting specific logical elements; the patterns used as
criteria can include regular expressions. The output layout is controlled by
the '--report' option to have a tabular report, a tree view showing the
parents hierarchy for the logical element that matches the criteria, or just a
summary with the number of occurrences.

The following prints all symbols and types that contain 'inte' in their names
or types, using a tab layout and given the number of matches.

llvm-diva --select-nocase --select-regex --report=details,summary
          --select=INTe
          --attribute=level --print=symbols,types,instructions
          test.o

Logical View:

[000]           {File} 'test.o'
[003]     4     {TypeAlias} 'INTEGER' -> 'int'
[004]     5     {Variable} 'CONSTANT' -> 'const INTEGER'

-----------------------------
Element      Total      Found
-----------------------------
Scopes           4          0
Symbols          4          1
Types            2          1
Lines           16          0
-----------------------------
Total           26          2

===============
Comparison Mode
===============

In this mode llvm-diva compares logical views to produce a report with the
logical elements that are missing or added. We've found this a very powerful
aid in finding semantic differences in the debug information produced by
different toolchain versions or even completely different toolchains altogether
(For example a compiler producing DWARF can be directly compared against a
completely different compiler that produces CodeView).

There are 2 comparison methods: logical view and logical elements. The first
one compares the logical view as a whole unit; for a match, each compared
logical element must have the same parents and children. The second one
compares individual logical elements without considering if their parents are
the same. For both comparison methods, the equal criteria includes the name,
source code location, type, lexical scope level.

Given our previous example we found the above debug information issue (related
to the previous invalid scope location for the 'typedef int INTEGER') by
comparing against another compiler.

1  using INTPTR = const int *;
2  int foo(INTPTR ParamPtr, unsigned ParamUnsigned, bool ParamBool) {
3    if (ParamBool) {
4      typedef int INTEGER;
5      const INTEGER CONSTANT = 7;
6      return CONSTANT;
7    }
8    return ParamUnsigned;
9  }

Using GCC to generate test-gcc.o, we can apply a selection pattern with the
printing mode to obtain the following output.

llvm-diva --select-regex --select-nocase --report=details
          --select=INTe
          --attribute=level
          --print=symbols,types
          test.o test-gcc.o

Logical View:
[000]           {File} 'test.o'
[003]     4     {TypeAlias} 'INTEGER' -> 'int'
[004]     5     {Variable} 'CONSTANT' -> 'const INTEGER'

Logical View:
[000]           {File} 'test-gcc.o'
[004]     4     {TypeAlias} 'INTEGER' -> 'int'
[004]     5     {Variable} 'CONSTANT' -> 'const INTEGER'

The output shows that both objects contain the same elements. But the
'typedef INTEGER' is located at different scope level. The GCC generated
object, shows '4', which is the correct value.

Note that there is no requirement that GCC must produce identical or similar
DWARF to clang in this case to allow the comparison. We're only comparing the
semantics.

Using the llvm-diva comparison functionality, that issue can be seen in a more
global context, that can include the logical view.

llvm-diva --compare=types --report=details,summary
          --attribute=level
          --print=symbols,types
          test.o test-gcc.o

Reference: 'test.o'
Target:    'test-gcc.o'

(1) Missing Types:
-[003]     4     {TypeAlias} 'INTEGER' -> 'int'

(1) Added Types:
+[004]     4     {TypeAlias} 'INTEGER' -> 'int'

----------------------------------------
Element   Expected    Missing      Added
----------------------------------------
Scopes           4          0          0
Symbols          0          0          0
Types            2          1          1
Lines            0          0          0
----------------------------------------
Total            6          1          1

The output shows in tabular form the missing (-), added (+) elements, giving
more context by swapping the reference and target object files.

llvm-diva --compare=types --report=view
         --attribute=level
          --print=symbols,types
          test.o test-gcc.o

Reference: 'test.o'
Target:    'test-gcc.o'

Logical View:
[000]           {File} 'test.o'
[001]             {CompileUnit} 'test.cpp'
[002]     1         {TypeAlias} 'INTPTR' -> '* const int'
[002]     2         {Function} extern not_inlined 'foo' -> 'int'
[003]                 {Block}
[004]     5             {Variable} 'CONSTANT' -> 'const INTEGER'
+[004]     4             {TypeAlias} 'INTEGER' -> 'int'
[003]     2           {Parameter} 'ParamBool' -> 'bool'
[003]     2           {Parameter} 'ParamPtr' -> 'INTPTR'
[003]     2           {Parameter} 'ParamUnsigned' -> 'unsigned int'
-[003]     4           {TypeAlias} 'INTEGER' -> 'int'

The output shows the merging view path (reference and target) with the missing
and added elements.

Comparing toolchains
--------------------

In the previous section, we compared GCC and Clang. The current implementation
of llvm-diva have sufficient support for CodeView format, making possible the
comparison between MSVC and Clang compilers.

-----------------------------------------------------------------------
pr_44884.cpp
-----------------------------------------------------------------------
1  int bar(float Input) { return (int)Input; }
2
 3  unsigned foo(char Param) {
4    typedef int INT;                      // ** Definition for INT **
5    INT Value = Param;
6    {
7      typedef float FLOAT;                // ** Definition for FLOAT **
8      {
9        FLOAT Added = Value + Param;
10        Value = bar(Added);
11      }
12    }
13    return Value + Param;
14  }

The above test (from PR44884) is used to illustrates a scope issue found in
the Clang compiler.

See: https://bugs.llvm.org/show_bug.cgi?id=44884

The lines 4 and 7 contains 2 typedefs, defined at different lexical scopes.
4    typedef int INT;
7      typedef float FLOAT;

These are the logical views that llvm-diva generates for 3 different compilers
(MSVC, Clang and GCC), emitting different debug info formats (CodeView, DWARF)
on different platforms.

-----------------------------------------------------------------------
pr_44884_dw.o - Compiled with Clang (DWARF format).
-----------------------------------------------------------------------
Logical View:
[000]           {File} 'pr_44884_dw.o' -> elf64-x86-64
[001]             {CompileUnit} 'pr_44884.cpp'
[002]               {Producer} 'clang version 11.0.0
[002]     7         {Function} extern not_inlined 'bar' -> 'int'
[003]     7           {Parameter} 'Input' -> 'float'
[002]     9         {Function} extern not_inlined 'foo' -> 'unsigned int'
[003]                 {Block}
[004]    15             {Variable} 'Added' -> 'FLOAT'
[003]     9           {Parameter} 'Param' -> 'char'
[003]    13           {TypeAlias} 'FLOAT' -> 'float'
[003]    10           {TypeAlias} 'INT' -> 'int'
[003]    11           {Variable} 'Value' -> 'INT'

-----------------------------------------------------------------------
pr_44884_gc.o - Compiled with GCC (DWARF Format).
-----------------------------------------------------------------------
Logical View:
[000]           {File} 'pr_44884_gc.o' -> elf64-x86-64
[001]             {CompileUnit} 'pr_44884.cpp'
[002]               {Producer} 'GNU C++ 5.5.0 20171010'
[002]     7         {Function} extern not_inlined 'bar' -> 'int'
[003]     7           {Parameter} 'Input' -> 'float'
[002]     9         {Function} extern not_inlined 'foo' -> 'unsigned int'
[003]                 {Block}
[004]                   {Block}
[005]    15               {Variable} 'Added' -> 'FLOAT'
[004]    13             {TypeAlias} 'FLOAT' -> 'float'
[003]     9           {Parameter} 'Param' -> 'char'
[003]    10           {TypeAlias} 'INT' -> 'int'
[003]    11           {Variable} 'Value' -> 'INT'

-----------------------------------------------------------------------
pr_44884_cv.o - Compiled with Clang (CodeView format).
-----------------------------------------------------------------------
Logical View:
[000]           {File} 'pr_44884_cv.o' -> COFF-x86-64
[001]             {CompileUnit} 'pr_44884.cpp'
[002]               {Producer} 'clang version 11.0.0
[002]               {Function} extern not_inlined 'bar' -> 'int'
[003]                 {Parameter} 'Input' -> 'float'
[002]               {Function} extern not_inlined 'foo' -> 'unsigned'
[003]                 {Block}
[004]                   {Variable} 'Added' -> 'float'
[003]                 {Parameter} 'Param' -> 'char'
[003]                 {TypeAlias} 'FLOAT' -> 'float'
[003]                 {TypeAlias} 'INT' -> 'int'
[003]                 {Variable} 'Value' -> 'int'

-----------------------------------------------------------------------
pr_44884_ms.o - Compiled with MSVC (CodeView Format).
-----------------------------------------------------------------------
Logical View:
[000]           {File} 'pr_44884_ms.o' -> COFF-i386
[001]             {CompileUnit} 'pr_44884.cpp'
[002]               {Producer} 'Microsoft (R) Optimizing Compiler'
[002]               {Function} extern not_inlined 'bar' -> 'int'
[003]                 {Parameter} 'Input' -> 'float'
[002]               {Function} extern not_inlined 'foo' -> 'unsigned'
[003]                 {Block}
[004]                   {Block}
[005]                     {Variable} 'Added' -> 'float'
[004]                   {TypeAlias} 'FLOAT' -> 'float'
[003]                 {Parameter} 'Param' -> 'char'
[003]                 {TypeAlias} 'INT' -> 'int'
[003]                 {Variable} 'Value' -> 'int'

>From the previous logical views, we can see that the Clang compiler emits both
typedefs at the same lexical scope (3), which is wrong, while GCC and MSVC emit
correct lexical scope for both typedefs.

---------+----------+----------------------------------------------------------
Compiler | Format   | Lexical Scope
---------|----------|----------------------------------------------------------
Clang    | DWARF    | [003]    13           {TypeAlias} 'FLOAT' -> 'float'
         |          | [003]    10           {TypeAlias} 'INT' -> 'int'
---------|----------+----------------------------------------------------------
GCC      | DWARF    | [004]    13             {TypeAlias} 'FLOAT' -> 'float'
         |          | [003]    10           {TypeAlias} 'INT' -> 'int'
---------|----------|----------------------------------------------------------
Clang    | CodeView | [003]                 {TypeAlias} 'FLOAT' -> 'float'
         |          | [003]                 {TypeAlias} 'INT' -> 'int'
---------|----------|----------------------------------------------------------
MSVC     | CodeView | [004]                   {TypeAlias} 'FLOAT' -> 'float'
         |          | [003]                 {TypeAlias} 'INT' -> 'int'
---------+----------+----------------------------------------------------------

Note: One of the main limitations while processing CodeView debug info, is the
reduced line information emitted for types and symbols, making difficult to use
the comparison feature within llvm-diva, as the line numbers are one of the
criteria for logical element match. In the meantime, any graphical comparison
tool is able to compare and show the logical view differences.

The above table shows the omitted line numbers for the referenced typedefs.

==============
Current Status
==============

Generates complete logical views for DWARF including:
- Scopes, symbols, types, lines.
- Variable location, coverage and location gaps.
- Disassembly of text sections associated with .debug_line records.
- Emission of warnings for invalid ranges, lines with line zero.
- Comparison: logical views and elements.

Generates partial logical views for COFF/CodeView (objects and PDB), including:
- Scopes, symbols, types, lines.
- Comparison: logical views and elements.

During the development of llvm-diva, we have found the following LLVM debug
issues:

- PR43860 - COFF Debug info shows variable at the wrong lexical scope
- PR43905 - COFF Debug info missing nested enumeration
- PR44884 - Debug information shows incorrect lexical scope for typedef
- PR46361 - [CodeView] Omitted class member function declaration for lambda
- PR46394 - [CodeView] Missing LF_NESTTYPE with nested templates

==============
Work remaining
==============

The following are the main tasks that needs to be finished:
- Logical View in JSON format. Currently it uses free form text style.
- COFF/CodeView disassembly text sections.
- COFF/CodeView ranges and locations.

==========
Conclusion
==========

The source code has been uploaded for review on phabricator at this link:

https://reviews.llvm.org/Dxxxx.

The review covers two patches:

A first patch with a IntervalTree data structure implementation which is
required by llvm-diva.

A second patch with the actual tool (in llvm/tools/llvm-diva).

Once these first two patches are committed, the plan is to keep working on
llvm-diva with the help of the community to address current limitations and
find good solutions/fixes for any design issues.

We hope the community will find llvm-diva useful like we have.

Special thanks to Orlando Cazalet-Hyams by testing the tool and to Greg Bedwell,
Phillip Power and Paul Robinson by suggesting improvements and reviewing the tool
documentation.

Thanks for your time.

-Carlos



**********************************************************************
This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify siee.postmaster at sony.com<mailto:siee.postmaster at sony.com>
This footnote also confirms that this email message has been checked for all known viruses.
Sony Interactive Entertainment Europe Limited
Registered Office: 10 Great Marlborough Street, London W1F 7LP, United Kingdom
Registered in England: 3277793
**********************************************************************

P Please consider the environment before printing this e-mail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200810/0dfcf459/attachment.html>


More information about the llvm-dev mailing list