Sanitizer coverage: PC vs modules are not handled efficiently?

Kostya Serebryany kcc at google.com
Fri Mar 20 11:36:36 PDT 2015


On Fri, Mar 20, 2015 at 11:22 AM, Timur Iskhodzhanov <timurrrr at google.com>
wrote:

> Hi,
> I'm working on enabling sanitizer coverage support for ASan on Windows.
>
> I got a PoC working, but didn't like how the code fits one aspect of the
> architecture and wanted to discuss how to fix that.
>
>
> Let's start with CoverageData::DumpOffsets().
> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>   733 void CoverageData::DumpOffsets() {
>   ...
>   739   for (uptr m = 0; m < module_name_vec.size(); m++) {
>   ...
>   744     auto r = module_name_vec[m];
>   ...
>   748     const char *module_name = "<unknown>";
>   749     for (uptr i = r.beg; i < r.end; i++) {
>   ...
>   753       uptr offset = 0;
>   754       sym->GetModuleNameAndOffsetForPC(pc, &module_name, &offset);
>   755       offsets.push_back(BundlePcAndCounter(offset, counter));
>   756     }
>   ...
>   770     module_name = StripModuleName(r.name);
>
> Looking at line 770, I think we don't need to query
> GetModuleNameAndOffsetForPC for the module name.
> Looking at lines 754/755, why don't we just store the module boundaries in
> module_name_vec along with its name?
> That'd save us one extra GetModuleNameAndOffsetForPC call [a linear lookup
> on the *fast* path].
> We already know what module this PC belongs to, right?
>

Yes, that may work. (you never know before you test it).
The code is like this because it got refactored from another state where
the PCs from different modules were intermixed.


>
>
> Now let's look at CoverageData::UpdateModuleNameVec(...).
> lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
>   333 void CoverageData::UpdateModuleNameVec(uptr caller_pc, uptr
> range_beg,
>   334                                        uptr range_end) {
>   ...
>   338   const char *module_name = sym->GetModuleNameForPc(caller_pc);
>   339   if (!module_name) return;
>   340   if (module_name_vec.empty() || module_name_vec.back().name !=
> module_name)
>   341     module_name_vec.push_back({module_name, range_beg, range_end});
>   ...
>   344 }
>
> GetModuleNameForPc(pc) returns const char* and we just store it in a
> vector?  Intriguing.
> Module name strings compared as pointers?  Even more intriguing.  [see
> "side note" below"]
>
> lib/sanitizer_common/sanitizer_symbolizer.h:
>  89   const char *GetModuleNameForPc(uptr pc) {
>  90     const char *module_name = 0;
>  91     uptr unused;
>  92     if (GetModuleNameAndOffsetForPC(pc, &module_name, &unused))
>  93       return module_name;
>
> GetModuleNameAndOffsetForPC forwards
> to PlatformFindModuleNameAndOffsetForAddress, which on POSIX is:
> lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc
>   397   bool PlatformFindModuleNameAndOffsetForAddress(uptr address,
>   398                                                  const char
> **module_name,
>   399                                                  uptr
> *module_offset) override {
>   400     LoadedModule *module = FindModuleForAddress(address);
>   401     if (module == 0)
>   402       return false;
>   403     *module_name = module->full_name();
>   404     *module_offset = address - module->base_address();
>   405     return true;
>   406   }
> ...
>   368   LoadedModule *FindModuleForAddress(uptr address) {
>   369     bool modules_were_reloaded = false;
>   370     if (modules_ == 0 || !modules_fresh_) {
>   371       modules_ = (LoadedModule*)(symbolizer_allocator_.Allocate(
>   372           kMaxNumberOfModuleContexts * sizeof(LoadedModule)));
> ...
>   374       n_modules_ = GetListOfModules(modules_,
> kMaxNumberOfModuleContexts,
>   375                                     /* filter */ 0);
> ...
>   378       modules_fresh_ = true;
>   379       modules_were_reloaded = true;
>   380     }
>   381     for (uptr i = 0; i < n_modules_; i++) {
>   382       if (modules_[i].containsAddress(address)) {
>   383         return &modules_[i];
>   384       }
>   385     }
>   386     // Reload the modules and look up again, if we haven't tried it
> yet.
>   387     if (!modules_were_reloaded) {
>   388       // FIXME: set modules_fresh_ from dlopen()/dlclose()
> interceptors.
>   389       // It's too aggressive to reload the list of modules each time
> we fail
>   390       // to find a module for a given address.
>   391       modules_fresh_ = false;
>   392       return FindModuleForAddress(address);
>   393     }
>   394     return 0;
>   395   }
>
> We seem to leak 80...160KB every time a new module is loaded [calls
> UpdateModuleNameVec].
>

Good catch, fix it!


> Who owns the module name strings stored in module_name_vec?
> OK, the LoadedModule strdup's a module name, so we also leak all the
> module names when a new module is loaded.
> I'm not sure linear search is the right thing to do, though I'm OK with it
> if this function does not show in the profile.
>
> Side note: It looks possible that dlopen+dlclosing a module would cause
> the same module to have two equal module names at different addresses,
> hence it will appear twice in module_name_vec -- is it a bug or a feature?
>

Hm. Maybe yes, maybe not. We don't support dlclose particularly well.


>
>
> Let's now consider that I have to
> re-implement PlatformFindModuleNameAndOffsetForAddress for Windows.
> I don't think I want to implement it the same way it's done for POSIX, as
> I find that implementation suboptimal.
>
> It seems the usage pattern of PlatformFindModuleNameAndOffsetForAddress is
> to query batches of PCs from one module before going to the next one.
> Assuming I can observe FreeLibrary [or dlclose], I can just cache the
> module info [name+boundaries] for the last lookup without the need to
> re-query the list of modules.
> I think the same can be done on POSIX?
> Maybe we should use GetModuleByPC(PC), store the result in
> module_for_the_last_pc_ and check if a new PC is there?
>
> I think we should also change the way we own module names.
> Can we store them in a string-set (do we have one?)
>

we don't, but we can use an array :)
(sorted or not, depends on how often we access vs modify it)


> to make sure they are unique'd and pointer comparison is OK?
> Currently there's no owner for module names on Windows, as we haven't
> needed to store LoadedModules before.
> Querying most of the things was as simple as
>   char xyz[...];
>   GetWhatINeed(xyz, sizeof(xyz));
> so the "owner" of the string was stack.
> Doing a strdup on every call is impractical :)
>
>
> Ideas?
>

Fix all of these, one by one!


> --
> Timur
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150320/49cdd5db/attachment.html>


More information about the llvm-commits mailing list