[lldb-dev] Handling of the ELF files missing build-ids?

Pavel Labath via lldb-dev lldb-dev at lists.llvm.org
Tue Aug 7 02:16:35 PDT 2018


Sounds like a plan.

On Tue, 7 Aug 2018 at 01:30, Leonard Mosescu <mosescu at google.com> wrote:
>>
>> I am fine with all the above except some reservations about case C. No need to calculate something if it isn't useful. For case C it should be fine to never match as if a file has a UUID to begin with it typically isn't something that gets stripped in a stripped binary. So we should either have it or not. If breakpad does calculate a CRC32, then we need to know to ignore the UUID. The problem is we probably won't be able to tell what the UUID is: real from build ID, or from GNU debug info CRC, or CRC of entire file. So the minidump code will need to do something here. If a minidump has the linux auxv and memory map in them, then we might need to dig through the section information and deduce if a file matches or not based off the size of mapped program headers to further help with the matching.
>>
>> One other idea is to make a set of enumerations for the UUID type:
>>
>> class UUID {
>>   enum class Type {
>>     BuildID,         // A build ID from the compiler or linker
>>     GNUDebugInfoCRC, // GNU debug info CRC
>>     MD5,             // MD5 of entire file
>>     MD5NonDebug,     // MD5 of the non debug info related bits
>>     CRC32,           // CRC32 of entire file
>>     Other,           // Anything else
>>   };
>> };
>>
>> The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts of the file that don't change with debug info or any paths found in debug info or symbols tables. So if you build a binary in /tmp/a or in /private/var/local/foo, the UUID is the same if the binary is essentially the same (code, data, etc).
>>
>> Then we can make intelligent comparisons between UUID types. Might even be possible for a module to have more than 1 UUID then if a binary contains a eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a CRC32 or MD5, then those can be calculated on the fly. The GetUUID on lldb_private::Module might become:
>>
>> const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type);
>>
>> Thoughts?
>>
>> Greg
>
>
> I like the idea of UUID sub-types! This solves the problem is a more generic fashion and it's also extensible. Interestingly enough, for Crashpad we're considering something similar (the fabricated UUIDs would have a different CvRecord signature)
>
> Case C. is a bit ugly so let me elaborate: this is specific to Breakpad minidump + ELF binaries w/o build-id. So we'd still need to have a way to force the match of the modules in the minidump with the local files. This ought to be an off-by-default, sharp tool which you'd only need to deal with Breakpad minidumps, and even then it would only be a fall-back that must be explicitly requested.
>
>> 1. .gnu_debuglink separate file pointer
>> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>.
>> This is where the choice of the crc algorithm comes from.
>
>
> Thanks Pavel. As you noted yourself, this doesn't mean that the UUID has to be tied to the checksum (they are exclusive options). In particular, for performance reasons I think it's desirable to avoid calculating checksums for every ELF module just in case.
>
>> I think we might have something already which could serve this
>> purpose. Eugene added a couple of months ago a mechanism to force-load
>> symbols for a given file regardless of the UUIDs
>> <https://reviews.llvm.org/D35607>. It requires an explicit "target
>> symbols add" command (which seems reasonable, as lldb has no way to
>> tell if it's doing things right). Would something like that work for
>> you?
>
>
> Nice. We may have to update the C++ API, but something like this would do the trick for case C.
>
> To summarize the conversation so far:
>
> 1. We can fix cases A, B independent of C: if real UUIDs are missing, don't automatically use full file CRC32 as UUID.
> 2. Pay attention to make sure that we don't break .gnu_debuglink or remote debugging (thanks Pavel)
> 3. Multiple types/namespaces for UUIDs would be a great feature!
> 4. Eugene's symbol forcing trick could be extended to handle case C
>
> Did I miss anything?
>
> My current plan is to start with #1, then look into #4 (force symbols match).
>
>
> On Sun, Aug 5, 2018 at 12:11 PM, Pavel Labath <labath at google.com> wrote:
>>
>> Hello Leonard,
>>
>> while I'm in principle supportive of this idea, I think it's not going
>> to be as easy as you might imagine. There are currently at least two
>> mechanisms which rely on this crc UUID.
>>
>> 1. .gnu_debuglink separate file pointer
>> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>.
>> This is where the choice of the crc algorithm comes from.
>>
>> In short, this mechanism for debug info location works like this: The
>> stripped file contains a .gnu_debuglink section.  The section contains
>> a file path and a crc checksum. After reading this section the
>> debugger is expected to look for the file at the given path, and then
>> compute it's checksum to verify it is indeed the correct file (hasn't
>> been modified).
>>
>> In LLDB, this is implemented somewhat differently. First we have a
>> mechanism for assigning UUIDs to modules. This mechanism does one of
>> two things (I'm assuming here none of the files have proper build-ids
>> in them). If the file has a .gnu_debuglink section (the stripped
>> file), we fetch the CRC from there, and assign that as the build-id of
>> the file. If the file doesn't have this section, we **assume** it is
>> going to be a target of gnu_debuglink pointer in another file, compute
>> its CRC via the specified algorithm, and assign that as the build-id.
>>
>> Second, we have a mechanism for locating the unstripped file. This
>> just fetches the path from gnu_debuglink, and compares the UUIDs of
>> the two modules. This works because of the UUID algorithm in the first
>> part.
>>
>> Now, this is not a particularly smart way of doing things and it is
>> the cause of the most of your problems. However, it is completely
>> avoidable. Instead of piggy-backing on the UUID mechanism, we should
>> just change the search mechanism (the second part) to compute the crc
>> itself, and only after it successfully finds the file referenced by
>> the gnu_debuglink section. This way, we will only compute the crc only
>> when absolutely necessary (i.e. for your use case, never).
>>
>>
>> 2. Currently, for remote debugging, we assume that each module has
>> some sort of a unique identifier which we can check to see whether we
>> have downloaded that file already (see qModuleInfo packet). I am not
>> sure what would happen if we suddenly just stopped computing a UUID
>> for these files, but we at the very least lose the ability to detect
>> changes to the remote files. For this item, I am not sure what would
>> be the best course of action. Maybe we should just start relying on
>> modification timestamp for these files?
>>
>> On Sat, 4 Aug 2018 at 02:17, Leonard Mosescu <mosescu at google.com> wrote:
>> >
>> > Greg, Mark,
>> >
>> > Looking at the code, LLDB falls back to a full file crc32 to create the module UUID if the ELF build-id is missing. This works, in the sense that the generated UUID does indeed identify the module.
>> >
>> > But there are a few problems with this approach:
>> >
>> > 1. First, runtime performance: a full file crc32 is a terribly inefficient way to generate a temporary UUID that is basically just used to match a local file to itself.
>> > - especially when some unstripped binaries can be very large. for example a local chromium build produces a 5.3Gb chrome binary
>> > - the crc32 implementation is decent, but single-threaded
>> > - to add insult to the injury, it seems a small bug defeats the intention to cache the hash value so it ends up being recalculated multiple times
>> >
>> > 2. The fake UUID is not going to match any external UUID that may be floating around (and yet not properly embedded into the binary)
>> > - an example is Breakpad, which unfortunately also attempts to make up UUIDs when the build-id is missing (something we'll hopefully fix soon)
>> >
>> > Is there a fundamental reason to calculate the full file crc32? If not I propose to improve this based on the following observations:
>> >
>> > A. Model the reality more accurately: an ELF w/o a build-id doesn't really have an UUID. So use a zero-length UUID in LLDB.
>> > B. The full file name should be enough to prove the identity of a local module.
>> > C. If we try to match an external UUID (ex. from a minidump) with a local file which does not have an UUID it may help to have an option to allow it to match (off by default, and only if there's no better match)
>>
>> I think we might have something already which could serve this
>> purpose. Eugene added a couple of months ago a mechanism to force-load
>> symbols for a given file regardless of the UUIDs
>> <https://reviews.llvm.org/D35607>. It requires an explicit "target
>> symbols add" command (which seems reasonable, as lldb has no way to
>> tell if it's doing things right). Would something like that work for
>> you?
>>
>> cheers,
>> pl
>
>


More information about the lldb-dev mailing list