Sanitizer coverage: PC vs modules are not handled efficiently?
Kostya Serebryany
kcc at google.com
Fri Mar 20 11:59:55 PDT 2015
On Fri, Mar 20, 2015 at 11:36 AM, Kostya Serebryany <kcc at google.com> wrote:
>
>
> On Fri, Mar 20, 2015 at 11:22 AM, Timur Iskhodzhanov <timurrrr at google.com>
> wrote:
>
>> Hi,
>> I'm working on enabling sanitizer coverage support for ASan on Windows.
>>
>> I got a PoC working, but didn't like how the code fits one aspect of the
>> architecture and wanted to discuss how to fix that.
>>
>>
>> Let's start with CoverageData::DumpOffsets().
>> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>> 733 void CoverageData::DumpOffsets() {
>> ...
>> 739 for (uptr m = 0; m < module_name_vec.size(); m++) {
>> ...
>> 744 auto r = module_name_vec[m];
>> ...
>> 748 const char *module_name = "<unknown>";
>> 749 for (uptr i = r.beg; i < r.end; i++) {
>> ...
>> 753 uptr offset = 0;
>> 754 sym->GetModuleNameAndOffsetForPC(pc, &module_name, &offset);
>> 755 offsets.push_back(BundlePcAndCounter(offset, counter));
>> 756 }
>> ...
>> 770 module_name = StripModuleName(r.name);
>>
>> Looking at line 770, I think we don't need to query
>> GetModuleNameAndOffsetForPC for the module name.
>> Looking at lines 754/755, why don't we just store the module boundaries
>> in module_name_vec along with its name?
>> That'd save us one extra GetModuleNameAndOffsetForPC call [a linear
>> lookup on the *fast* path].
>> We already know what module this PC belongs to, right?
>>
>
> Yes, that may work. (you never know before you test it).
> The code is like this because it got refactored from another state where
> the PCs from different modules were intermixed.
>
>
>>
>>
>> Now let's look at CoverageData::UpdateModuleNameVec(...).
>> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>> 333 void CoverageData::UpdateModuleNameVec(uptr caller_pc, uptr
>> range_beg,
>> 334 uptr range_end) {
>> ...
>> 338 const char *module_name = sym->GetModuleNameForPc(caller_pc);
>> 339 if (!module_name) return;
>> 340 if (module_name_vec.empty() || module_name_vec.back().name !=
>> module_name)
>> 341 module_name_vec.push_back({module_name, range_beg, range_end});
>> ...
>> 344 }
>>
>> GetModuleNameForPc(pc) returns const char* and we just store it in a
>> vector? Intriguing.
>> Module name strings compared as pointers? Even more intriguing. [see
>> "side note" below"]
>>
>> lib/sanitizer_common/sanitizer_symbolizer.h:
>> 89 const char *GetModuleNameForPc(uptr pc) {
>> 90 const char *module_name = 0;
>> 91 uptr unused;
>> 92 if (GetModuleNameAndOffsetForPC(pc, &module_name, &unused))
>> 93 return module_name;
>>
>> GetModuleNameAndOffsetForPC forwards
>> to PlatformFindModuleNameAndOffsetForAddress, which on POSIX is:
>> lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc
>> 397 bool PlatformFindModuleNameAndOffsetForAddress(uptr address,
>> 398 const char
>> **module_name,
>> 399 uptr
>> *module_offset) override {
>> 400 LoadedModule *module = FindModuleForAddress(address);
>> 401 if (module == 0)
>> 402 return false;
>> 403 *module_name = module->full_name();
>> 404 *module_offset = address - module->base_address();
>> 405 return true;
>> 406 }
>> ...
>> 368 LoadedModule *FindModuleForAddress(uptr address) {
>> 369 bool modules_were_reloaded = false;
>> 370 if (modules_ == 0 || !modules_fresh_) {
>> 371 modules_ = (LoadedModule*)(symbolizer_allocator_.Allocate(
>> 372 kMaxNumberOfModuleContexts * sizeof(LoadedModule)));
>> ...
>> 374 n_modules_ = GetListOfModules(modules_,
>> kMaxNumberOfModuleContexts,
>> 375 /* filter */ 0);
>> ...
>> 378 modules_fresh_ = true;
>> 379 modules_were_reloaded = true;
>> 380 }
>> 381 for (uptr i = 0; i < n_modules_; i++) {
>> 382 if (modules_[i].containsAddress(address)) {
>> 383 return &modules_[i];
>> 384 }
>> 385 }
>> 386 // Reload the modules and look up again, if we haven't tried it
>> yet.
>> 387 if (!modules_were_reloaded) {
>> 388 // FIXME: set modules_fresh_ from dlopen()/dlclose()
>> interceptors.
>> 389 // It's too aggressive to reload the list of modules each
>> time we fail
>> 390 // to find a module for a given address.
>> 391 modules_fresh_ = false;
>> 392 return FindModuleForAddress(address);
>> 393 }
>> 394 return 0;
>> 395 }
>>
>> We seem to leak 80...160KB every time a new module is loaded [calls
>> UpdateModuleNameVec].
>>
>
> Good catch, fix it!
>
Probably not that bad.
At least the implementation in sanitizer_symbolizer_posix_libcdep.cc
returns the same pointer for module name every time,
unless new modules were loaded. So, on linux everything should work fine.
>
>
>> Who owns the module name strings stored in module_name_vec?
>> OK, the LoadedModule strdup's a module name, so we also leak all the
>> module names when a new module is loaded.
>> I'm not sure linear search is the right thing to do, though I'm OK with
>> it if this function does not show in the profile.
>>
>> Side note: It looks possible that dlopen+dlclosing a module would cause
>> the same module to have two equal module names at different addresses,
>> hence it will appear twice in module_name_vec -- is it a bug or a feature?
>>
>
> Hm. Maybe yes, maybe not. We don't support dlclose particularly well.
>
>> Re: bug or feature question -- what's the expected result dump if we
load/unload/load the same library? I'd expect the counters/bits to be
merged
It would be nice to merge them, but at least if the library is loaded at
different address we won't do that.
Again, we don't support dlclose very well, and thi sis not the only trouble
with it.
>
>
>>
>>
>> Let's now consider that I have to
>> re-implement PlatformFindModuleNameAndOffsetForAddress for Windows.
>> I don't think I want to implement it the same way it's done for POSIX, as
>> I find that implementation suboptimal.
>>
>> It seems the usage pattern of PlatformFindModuleNameAndOffsetForAddress
>> is to query batches of PCs from one module before going to the next one.
>> Assuming I can observe FreeLibrary [or dlclose], I can just cache the
>> module info [name+boundaries] for the last lookup without the need to
>> re-query the list of modules.
>> I think the same can be done on POSIX?
>> Maybe we should use GetModuleByPC(PC), store the result in
>> module_for_the_last_pc_ and check if a new PC is there?
>>
>> I think we should also change the way we own module names.
>> Can we store them in a string-set (do we have one?)
>>
>
> we don't, but we can use an array :)
> (sorted or not, depends on how often we access vs modify it)
>
>
>> to make sure they are unique'd and pointer comparison is OK?
>> Currently there's no owner for module names on Windows, as we haven't
>> needed to store LoadedModules before.
>> Querying most of the things was as simple as
>> char xyz[...];
>> GetWhatINeed(xyz, sizeof(xyz));
>> so the "owner" of the string was stack.
>> Doing a strdup on every call is impractical :)
>>
>>
>> Ideas?
>>
>
> Fix all of these, one by one!
>
>
>> --
>> Timur
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150320/af4fed1d/attachment.html>
More information about the llvm-commits
mailing list