[lldb-dev] New DWARF parser proposal

Fri Aug 30 16:08:54 PDT 2013

At Apple, when we have our own accelerator tables and we partially parse everything. We know our accelerator tables are good and actually useful. The DWARF tables like ".debug_pubnames" only shows public names which makes it useless. DWARF ".debug_aranges" is often incomplete or possibly missing like you said below, and DWARF "accelerator" tables (yes I put the quotes on there for a reason), aren't really great because they force you to parse them and the data contained within then is randomly ordered, so to make them useful, you must parse them and sort them so they are useable. So we at Apple made accelerator tables that actually work. The format of these accelerator tables is well defined and available as a specification in the LLVM/Clang sources. They work well and stop us from having to manually index (like all targets except Darwin) the DWARF to make sure we have the truth.

I want to clarify a few things from what was stated... When we parse DIEs, make an DIE array per compile unit, though each DIE is very small: 4 uint32_t objects per DIE. I have a way to reduce this even further to just 2 uint32_t objects. So we are only storing the DIE offsets (32 bits each) and any data necessary to be able to find the first child, sibling, or parent DIE offset using the current DIE. So this data isn't all that huge. This data then can be used to navigate the DWARF. 

On Aug 30, 2013, at 1:27 PM, Richard Mitton <richard at codersnotes.com> wrote:

> DWARF parsing is currently very slow. On my machine, loading the 'clang' binary into lldb takes 14 seconds (vs. gdb's 29 seconds). The actual I/O cost of reading that much data is only around 2 seconds.
> 
> The DWARF parser has to read the entire .debug_info section into memory and parse it. It would be great if we didn't have to do this. OS X has the .apple_names section et al, which allow lldb to automatically have an index without having to parse anything.
> 
> However, this is an Apple extension and does not exist on other platforms. There are a bunch of accelerator tables that the DWARF spec allows for, but they're all unusable. pubnames is either absent or incomplete, aranges might not be present, and if they were generated by the compiler rather than the linker, then they may not cover all the object files in the binary (causing lldb to miss symbols). So we end up in the situation we have now, where we cannot use or trust the accelerators, and have to parse everything anyway to build our own index.
> 
> I believe lldb does this the wrong way. Performing just a simple "b main" test, it will touch the entire debug_info section *4 times* (at least), which on the clang binary example is 600MB of data each time:
> 
> - pass 1 (extract DIEs)
> - pass 2 (index DIEs)
> - pass 3 (extract DIEs)
> - pass 4 (build arange tables)
> 
> I believe the key problems of the current design are:
>    1) lldb tries to build it's own DIE array copy, rather than just referring to the existing data in-place. This adds a significant management overhead to all functions accessing this data.

We do refer to the data in place, our DIE is actually just the DIE offset of the DIE itself within the .debug_info section (32 bits), parent and sibling indexes (32 bits each), and some state that is stored in each DIE because the compiler will pad the DIE anyway. So the total for each DIE is 128 bits. We do not store _anything_ else. When we need to access the data in each DIE, we go to the DIE offset of the DIE and starting parsing the information we need, then we throw that information away.

>    2) lldb goes to great efforts to avoid reading the entire debug information (even though it will ultimately need to anyway) and to avoid keeping it in memory. This in fact causes it to *reload* it several times, as each further operation performs lazy initialization and causes a re-parse.

This can be easily fixed by making sure when we scan it the first time that we generate any needed accelerator tables at that time to avoid any re-parsing. Clang and its 600MB is nothing compared to debugging WebKit with 1.8GB of debug info, or debugging Xcode with 200 shared libraries that all have debug info. So we need to free up this memory, or LLDB ends up taking up too much memory when debugging applications with lots of debug info.

> 
> If we just accepted that we are forced to load all the data once, it would actually be faster. My suggestion therefore is to write an optimized single-pass DWARF indexer to replace the current DWARF loader, with the following properties:
> 
> - Always read debug info exactly once upon module load (unless we can guarantee apple extensions are used).

That sounds nice in theory, but when you parse the .debug_info, what would you store in order to be able to navigate the DWARF (parent, child, sibling)? At least you will need to store the DIE offset of each DIE in an array. That is at least 32 bits. I have a solution that can get the DWARFDebugInfoEntry down to 2 32 bit values: the DIE offset, and the depth. Then each DIE is only 64 bits. I haven't ported this into LLDB yet, but I can do this. 

> - Use the entire debug_info section in-place, without trying to build a copy. 

We do use the entire .debug_info section in place. We need the DIE offsets at the very least for all DIEs in a compile unit. If you think we don't, I would like to hear how you plan to do this.

> Not having separate stages for extraction and indexing will allow efficient data traversal.

This can be fixed in the current situation be determining if we need to generate the accelerator tables and making sure we only traverse the DWARF once. On MacOSX, with the accelerator tables, we don't need to traverse any DWARF, except for the compile units we find through our accelerator table lookups.

> - Make use of the abbreviation tables to pre-build decoder structures for each DWARF tag type. In most cases we can know the size of each DIE the moment we read it's abbreviation code in, and can skip in one operation if needed without having to parse the elements. Because we run in one pass, we never have to even look at DIEs we don't need.

I added this feature a while back thinking it was going to change the world and make parsing much much faster and it didn't on MacOSX. So I never checked it in. This is a good thing to do though as about 60-70% of DIEs have fixed sizes and it might make non MacOSX parsing faster when you do need to parse everything.

> - Track the parent scope as we go, on a stack, so we don't have to keep doing lookups which walk the DIE tree. The current parser walks up the tree to find what scope it's in, even though it already parsed the parent scope container.

Walking the tree is done all over the place, so any solution needs to be able to do the same thing. Walking the tree is not expensive and is very fast right now as each die knows it is store in an std:vector<> and it can calculate the parent very efficiently by returning "this - m_parent_idx", and its sibling by doing "this + m_sibling_idx" (there is a check to make sure the indexes are valid).

I don't see how you would make navigating DWARF efficient by always just reading from the .debug_info section with nothing else being stored.

> - Build arange tables automatically as we go, ignoring any that might already be present. We have already touched and extracted the range data anyway, it would be trivial to build an accelerator table for free.

That can just be combined with the initial parsing, but only if needed.

> - For strings, we should pre-pool the DWARF string table once up-front, to avoid repeatedly pooling strings for each DIE.

This can be done, but we often don't make all DWARF strings into lldb_private::ConstString values unless we actually parse the information. This will definitely bloat memory, but we would need to see by how much before we can say it works. Then we need to make a map that maps 32 bit string offsets to ConstString values. I am sure we can use llvm::DenseMap<>.
> 
> With this approach we use the DIE data as-is in memory, without having to make our own copy.

Again, we don't make a copy. What are we making a copy of? Each DIE is 4 32 bit values, I can reduce it to 2 32 bit values. I worry navigation will be slow if we don't store anything for DIEs, but I would be happy to see your implementation.

> Parent chains should ideally only be used during parsing. If parents/siblings are really needed after the initial parse, one easy solution would be to just store that in a separate hash table.

If you just store the following for each DIE:

class DWARFDebugInfoEntry {
    uint32_t m_offset;
    uint32_t m_depth;
}

Then you can reduce memory usage at the expense of having navigations (get parent/child/sibling) be slower.

> I welcome discussion on this. I think it's important for lldb to not have any delays on loading programs, and as we cannot control what the compilers will supply to us, we have to address this on our end.

I think a good start is to just parse the accelerator tables during the initial parsing and making sure we only touch the .debug_info once for indexing for name and address, throw it all away (save memory) and only partially parse what we need to when we need to after that, this would get us some performance benefit.

Any solution _must_ be able to do the following:
- Not hog memory
- Be able to quickly navigate parent/child/sibling
- Quickly find any DIE by DIE offset
- partially parse

The solution we came up with now relies heavily on the Apple DWARF accelerator tables to get its performance. Back before we have these tables we did index everything and we ran extensive memory performance tools on LLDB to stop it from using too much memory. 

I am all for improving performance, but we don't need all DWARF parsed up front. The Apple accelerator tables work great, and they can easily be enabled non-darwin copies of clang, and in other compilers. The Apple accelerator tables are being proposed to become the standard DWARF accelerator tables for DWARF 5. They are up agains the GDB index tables which to us are not that useful as the table data says "foo" is in compile unit 0x12232000. They don't specify the actual DIE within a compile unit...

So wrapping things up
1 - we should fix the multiple .debug_info pass issue when we need to generate the index tables as this is lame. We need to parse the accelerator tables, only if they need to be generated, when doing the initial parse. This initial parse doesn't need to happen if we have the Apple accelerator tables. We also need to parse all needed stuff for address and names in this initial pass, again only if needed
2 - Feel free to add code to the DWARFAbbreviationDeclaration() ask it for its byte size, and if it returns non-zero, then it we can just skip the entire DIE data when doing the initial parse, unless of course we need to index. If it returns zero, the DIE will need to be parsed one attribute/form at a time.