[llvm-dev] [DWARFv5] Reading the .debug_str_offsets section

Wed Jul 5 13:34:30 PDT 2017

There was some discussion about this in D34765, and I had a follow-up
chat with Wolfgang separately, plus spent a fair amount of time in
reading and thinking today.  I thought I would write my understanding
all down here so we can reach a common understanding of how it ought
to work, and therefore what our code should do.

For any non-DWARF-experts who might be interested, in principle this
section is straightforward: It's an array of offsets into .debug_str,
which in turn is a standard object-file string section.  The idea
behind .debug_str_offsets is that string references from other parts
of the DWARF can use an index into the array, instead of a direct
reference to the string section.  This means we end up with only one
object-file relocation per string, rather than one per reference.
Fewer relocations = smaller object files and faster link times.
To the extent that strings are referenced more than once, we win.

The devil is in the details.  There are three distinct interesting
cases, when we are talking about the standard DWARF layout of this
section, and then some wrinkles added by the GCC split-DWARF style
which does not use exactly the same layout.  [And then the DWARF
committee failed to use a different section name.  Our bad.]

First the three cases for standard DWARF.  These the "normal" (or
relocatable/executable) case, the "split" case, and the "package" case.

(a) For a .o file, the compiler produces a .debug_str_offsets section
which has one or more "contributions" in it.  Each contribution has a
header, which gives its size and whether the array elements are 32 or 
64 bits wide.  Any DWARF compile-unit or type-unit that uses the array
(that is, any unit that uses any of the "strx" forms) has a 
DW_AT_str_offsets_base attribute that points to the 0th element of the
array.  The producer chooses whether to have one contribution shared by
all units, one contribution per unit, or somewhere in between.

There's an implication for how to read the .debug_str_offsets section,
which is that the reader has to parse the section before using it to
look up any strings.  "Parsing" here really means just following the
sequence of contribution-headers to determine what element size to
associate with each contribution.

[I have previously described the layout differently, and I think that
was wrong.  Specifically there is no "array slicing" or other disjoint 
sharing of contributions across units.  Thanks to Wolfgang for getting 
me to understand that.]

An executable (or linker "-r" output) where all input files use the
standard DWARF style can be handled exactly the same way.  The linker
will append all the .debug_str_offsets contributions together, do all
the relocations, and the net result is a new sequence of headers and
arrays.

(b) For a .dwo file, the compiler produces a .debug_str_offsets.dwo
section, which is laid out like a single "contribution" in the .o file
(that is, there is one header describing the entire section, and only
one array of offsets).  Unlike the .o file, the DWARF spec says units in
the .dwo file do *not* have a DW_AT_str_offsets_base attribute; this 
means all units in the .dwo file must share the one and only array.  The 
missing attribute implicitly points to the 0th element of that array,
and "parsing" the section means looking at exactly one header.

[I currently think that requiring a single contribution in the .dwo file
is not a bug, but a feature, because it means .debug_line.dwo can use 
.debug_str_offsets.dwo without worrying about which contribution to use.]

(c) For a .dwp file, the packaging tool (like the linker) will append
all the section contributions from the various .dwo files together.
In lieu of relocations, the packager is required to construct an "index"
table, which allows a consumer to associate a particular DWARF unit with
the .debug_str_offsets.dwo contribution from the same .dwo file.  Note 
that this index tells the reader how to find the header, and from there
it can find the 0th element, just as when it is reading a .dwo file.

That's how things work for standard DWARF.  There was also a prototype
of this done in GCC, prior to standardization, which differs slightly.
(Clang also does this, but for simplicity I'll call it the GCC style.)
It differs in that there is no header and the offsets are always 32-bit.
AFAIK the GCC style is tied to split DWARF, meaning we see this style
only in a .debug_str_offsets.dwo section and not .debug_str_offsets.
Certainly we should see the GCC style only in the context of DWARF v4,
as a v5 producer ought to be using the standard style.  Also, GCC has
defined its own "form" codes, so any references to the table from other
parts of the DWARF can be decoded unambiguously.

What this does mean is that we can't look at .debug_str_offsets.dwo in
isolation and be sure how to interpret it.  It might have a standard
header, or it might be a GCC style table with no header.  We need to
look at the version of the associated .debug_info.dwo section, or know
which form codes are used to reference the table, before we can decide
whether a given .debug_str_offsets.dwo contribution is standard or GCC
style.

The same holds true for a .dwp file, where we do have the index to
slice up the section for us but any individual slice has the same
problem as the .dwo file it came from.

Things get way trickier in an object (executable or "-r" ouput) that
has a mix of GCC and standard contributions. AFAICT there's no 
equivalent of DW_AT_str_offsets_base in the GCC style, so about all 
we can do is something like this:
(1) Walk through all units to find all DW_AT_str_offsets_base pointers;
(2) for each one, poke around in the prior 8-16 bytes looking for
    the header; this is more reliable than it sounds;
(3) assume everything else in the section is GCC style.

At least that's what the dumper will have to do.  The debugger can
probably do it more lazily, but still kind of annoying.

Questions and brickbats welcome.
--paulr

P.S. Ah, you clever reader, who noticed I carefully said nothing about
LTO of mixed-DWARF-version compilations!  Haven't thought about it.