Sanitizer coverage: PC vs modules are not handled efficiently?

Fri Mar 20 11:22:52 PDT 2015

Hi,
I'm working on enabling sanitizer coverage support for ASan on Windows.

I got a PoC working, but didn't like how the code fits one aspect of the
architecture and wanted to discuss how to fix that.

Let's start with CoverageData::DumpOffsets().
lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
  733 void CoverageData::DumpOffsets() {
  ...
  739   for (uptr m = 0; m < module_name_vec.size(); m++) {
  ...
  744     auto r = module_name_vec[m];
  ...
  748     const char *module_name = "<unknown>";
  749     for (uptr i = r.beg; i < r.end; i++) {
  ...
  753       uptr offset = 0;
  754       sym->GetModuleNameAndOffsetForPC(pc, &module_name, &offset);
  755       offsets.push_back(BundlePcAndCounter(offset, counter));
  756     }
  ...
  770     module_name = StripModuleName(r.name);

Looking at line 770, I think we don't need to query
GetModuleNameAndOffsetForPC for the module name.
Looking at lines 754/755, why don't we just store the module boundaries in
module_name_vec along with its name?
That'd save us one extra GetModuleNameAndOffsetForPC call [a linear lookup
on the *fast* path].
We already know what module this PC belongs to, right?

Now let's look at CoverageData::UpdateModuleNameVec(...).
lib/sanitizer_common/sanitizer_coverage_libcdep.cc:
  333 void CoverageData::UpdateModuleNameVec(uptr caller_pc, uptr range_beg,
  334                                        uptr range_end) {
  ...
  338   const char *module_name = sym->GetModuleNameForPc(caller_pc);
  339   if (!module_name) return;
  340   if (module_name_vec.empty() || module_name_vec.back().name !=
module_name)
  341     module_name_vec.push_back({module_name, range_beg, range_end});
  ...
  344 }

GetModuleNameForPc(pc) returns const char* and we just store it in a
vector?  Intriguing.
Module name strings compared as pointers?  Even more intriguing.  [see
"side note" below"]

lib/sanitizer_common/sanitizer_symbolizer.h:
 89   const char *GetModuleNameForPc(uptr pc) {
 90     const char *module_name = 0;
 91     uptr unused;
 92     if (GetModuleNameAndOffsetForPC(pc, &module_name, &unused))
 93       return module_name;

GetModuleNameAndOffsetForPC forwards
to PlatformFindModuleNameAndOffsetForAddress, which on POSIX is:
lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc
  397   bool PlatformFindModuleNameAndOffsetForAddress(uptr address,
  398                                                  const char
**module_name,
  399                                                  uptr *module_offset)
override {
  400     LoadedModule *module = FindModuleForAddress(address);
  401     if (module == 0)
  402       return false;
  403     *module_name = module->full_name();
  404     *module_offset = address - module->base_address();
  405     return true;
  406   }
...
  368   LoadedModule *FindModuleForAddress(uptr address) {
  369     bool modules_were_reloaded = false;
  370     if (modules_ == 0 || !modules_fresh_) {
  371       modules_ = (LoadedModule*)(symbolizer_allocator_.Allocate(
  372           kMaxNumberOfModuleContexts * sizeof(LoadedModule)));
...
  374       n_modules_ = GetListOfModules(modules_,
kMaxNumberOfModuleContexts,
  375                                     /* filter */ 0);
...
  378       modules_fresh_ = true;
  379       modules_were_reloaded = true;
  380     }
  381     for (uptr i = 0; i < n_modules_; i++) {
  382       if (modules_[i].containsAddress(address)) {
  383         return &modules_[i];
  384       }
  385     }
  386     // Reload the modules and look up again, if we haven't tried it
yet.
  387     if (!modules_were_reloaded) {
  388       // FIXME: set modules_fresh_ from dlopen()/dlclose()
interceptors.
  389       // It's too aggressive to reload the list of modules each time
we fail
  390       // to find a module for a given address.
  391       modules_fresh_ = false;
  392       return FindModuleForAddress(address);
  393     }
  394     return 0;
  395   }

We seem to leak 80...160KB every time a new module is loaded [calls
UpdateModuleNameVec].
Who owns the module name strings stored in module_name_vec?
OK, the LoadedModule strdup's a module name, so we also leak all the module
names when a new module is loaded.
I'm not sure linear search is the right thing to do, though I'm OK with it
if this function does not show in the profile.

Side note: It looks possible that dlopen+dlclosing a module would cause the
same module to have two equal module names at different addresses, hence it
will appear twice in module_name_vec -- is it a bug or a feature?

Let's now consider that I have to
re-implement PlatformFindModuleNameAndOffsetForAddress for Windows.
I don't think I want to implement it the same way it's done for POSIX, as I
find that implementation suboptimal.

It seems the usage pattern of PlatformFindModuleNameAndOffsetForAddress is
to query batches of PCs from one module before going to the next one.
Assuming I can observe FreeLibrary [or dlclose], I can just cache the
module info [name+boundaries] for the last lookup without the need to
re-query the list of modules.
I think the same can be done on POSIX?
Maybe we should use GetModuleByPC(PC), store the result in
module_for_the_last_pc_ and check if a new PC is there?

I think we should also change the way we own module names.
Can we store them in a string-set (do we have one?) to make sure they are
unique'd and pointer comparison is OK?
Currently there's no owner for module names on Windows, as we haven't
needed to store LoadedModules before.
Querying most of the things was as simple as
  char xyz[...];
  GetWhatINeed(xyz, sizeof(xyz));
so the "owner" of the string was stack.
Doing a strdup on every call is impractical :)

Ideas?
--
Timur
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150320/4ac7ec41/attachment.html>