Sanitizer coverage: PC vs modules are not handled efficiently?
Timur Iskhodzhanov
timurrrr at google.com
Fri Mar 20 11:48:41 PDT 2015
Re: bug or feature question -- what's the expected result dump if we
load/unload/load the same library? I'd expect the counters/bits to be
merged
пт, 20 марта 2015, 21:36, Kostya Serebryany <kcc at google.com>:
> On Fri, Mar 20, 2015 at 11:22 AM, Timur Iskhodzhanov <timurrrr at google.com>
> wrote:
>
>> Hi,
>> I'm working on enabling sanitizer coverage support for ASan on Windows.
>>
>> I got a PoC working, but didn't like how the code fits one aspect of the
>> architecture and wanted to discuss how to fix that.
>>
>>
>> Let's start with CoverageData::DumpOffsets().
>> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>> 733 void CoverageData::DumpOffsets() {
>> ...
>> 739 for (uptr m = 0; m < module_name_vec.size(); m++) {
>> ...
>> 744 auto r = module_name_vec[m];
>> ...
>> 748 const char *module_name = "<unknown>";
>> 749 for (uptr i = r.beg; i < r.end; i++) {
>> ...
>> 753 uptr offset = 0;
>> 754 sym->GetModuleNameAndOffsetForPC(pc, &module_name, &offset);
>> 755 offsets.push_back(BundlePcAndCounter(offset, counter));
>> 756 }
>> ...
>> 770 module_name = StripModuleName(r.name);
>>
>> Looking at line 770, I think we don't need to query
>> GetModuleNameAndOffsetForPC for the module name.
>> Looking at lines 754/755, why don't we just store the module boundaries
>> in module_name_vec along with its name?
>> That'd save us one extra GetModuleNameAndOffsetForPC call [a linear
>> lookup on the *fast* path].
>> We already know what module this PC belongs to, right?
>>
>
> Yes, that may work. (you never know before you test it).
> The code is like this because it got refactored from another state where
> the PCs from different modules were intermixed.
>
>
>>
>>
>> Now let's look at CoverageData::UpdateModuleNameVec(...).
>> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>> 333 void CoverageData::UpdateModuleNameVec(uptr caller_pc, uptr
>> range_beg,
>> 334 uptr range_end) {
>> ...
>> 338 const char *module_name = sym->GetModuleNameForPc(caller_pc);
>> 339 if (!module_name) return;
>> 340 if (module_name_vec.empty() || module_name_vec.back().name !=
>> module_name)
>> 341 module_name_vec.push_back({module_name, range_beg, range_end});
>> ...
>> 344 }
>>
>> GetModuleNameForPc(pc) returns const char* and we just store it in a
>> vector? Intriguing.
>> Module name strings compared as pointers? Even more intriguing. [see
>> "side note" below"]
>>
>> lib/sanitizer_common/sanitizer_symbolizer.h:
>> 89 const char *GetModuleNameForPc(uptr pc) {
>> 90 const char *module_name = 0;
>> 91 uptr unused;
>> 92 if (GetModuleNameAndOffsetForPC(pc, &module_name, &unused))
>> 93 return module_name;
>>
>> GetModuleNameAndOffsetForPC forwards
>> to PlatformFindModuleNameAndOffsetForAddress, which on POSIX is:
>> lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc
>> 397 bool PlatformFindModuleNameAndOffsetForAddress(uptr address,
>> 398 const char
>> **module_name,
>> 399 uptr
>> *module_offset) override {
>> 400 LoadedModule *module = FindModuleForAddress(address);
>> 401 if (module == 0)
>> 402 return false;
>> 403 *module_name = module->full_name();
>> 404 *module_offset = address - module->base_address();
>> 405 return true;
>> 406 }
>> ...
>> 368 LoadedModule *FindModuleForAddress(uptr address) {
>> 369 bool modules_were_reloaded = false;
>> 370 if (modules_ == 0 || !modules_fresh_) {
>> 371 modules_ = (LoadedModule*)(symbolizer_allocator_.Allocate(
>> 372 kMaxNumberOfModuleContexts * sizeof(LoadedModule)));
>> ...
>> 374 n_modules_ = GetListOfModules(modules_,
>> kMaxNumberOfModuleContexts,
>> 375 /* filter */ 0);
>> ...
>> 378 modules_fresh_ = true;
>> 379 modules_were_reloaded = true;
>> 380 }
>> 381 for (uptr i = 0; i < n_modules_; i++) {
>> 382 if (modules_[i].containsAddress(address)) {
>> 383 return &modules_[i];
>> 384 }
>> 385 }
>> 386 // Reload the modules and look up again, if we haven't tried it
>> yet.
>> 387 if (!modules_were_reloaded) {
>> 388 // FIXME: set modules_fresh_ from dlopen()/dlclose()
>> interceptors.
>> 389 // It's too aggressive to reload the list of modules each
>> time we fail
>> 390 // to find a module for a given address.
>> 391 modules_fresh_ = false;
>> 392 return FindModuleForAddress(address);
>> 393 }
>> 394 return 0;
>> 395 }
>>
>> We seem to leak 80...160KB every time a new module is loaded [calls
>> UpdateModuleNameVec].
>>
>
> Good catch, fix it!
>
>
>> Who owns the module name strings stored in module_name_vec?
>> OK, the LoadedModule strdup's a module name, so we also leak all the
>> module names when a new module is loaded.
>> I'm not sure linear search is the right thing to do, though I'm OK with
>> it if this function does not show in the profile.
>>
>> Side note: It looks possible that dlopen+dlclosing a module would cause
>> the same module to have two equal module names at different addresses,
>> hence it will appear twice in module_name_vec -- is it a bug or a feature?
>>
>
> Hm. Maybe yes, maybe not. We don't support dlclose particularly well.
>
>
>>
>>
>> Let's now consider that I have to
>> re-implement PlatformFindModuleNameAndOffsetForAddress for Windows.
>> I don't think I want to implement it the same way it's done for POSIX, as
>> I find that implementation suboptimal.
>>
>> It seems the usage pattern of PlatformFindModuleNameAndOffsetForAddress
>> is to query batches of PCs from one module before going to the next one.
>> Assuming I can observe FreeLibrary [or dlclose], I can just cache the
>> module info [name+boundaries] for the last lookup without the need to
>> re-query the list of modules.
>> I think the same can be done on POSIX?
>> Maybe we should use GetModuleByPC(PC), store the result in
>> module_for_the_last_pc_ and check if a new PC is there?
>>
>> I think we should also change the way we own module names.
>> Can we store them in a string-set (do we have one?)
>>
>
> we don't, but we can use an array :)
> (sorted or not, depends on how often we access vs modify it)
>
>
>> to make sure they are unique'd and pointer comparison is OK?
>> Currently there's no owner for module names on Windows, as we haven't
>> needed to store LoadedModules before.
>> Querying most of the things was as simple as
>> char xyz[...];
>> GetWhatINeed(xyz, sizeof(xyz));
>> so the "owner" of the string was stack.
>> Doing a strdup on every call is impractical :)
>>
>>
>> Ideas?
>>
>
> Fix all of these, one by one!
>
>
>> --
>> Timur
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150320/94e90ba4/attachment.html>
More information about the llvm-commits
mailing list