[PATCH] D65633: [Object] Create MutableELFObject Class for Doing Mutations on ELFObjectFiles [Part 3]

Mon Aug 19 18:34:38 PDT 2019

abrachet marked 13 inline comments as done.
abrachet added inline comments.

================
Comment at: llvm/include/llvm/Object/MutableELFObject.h:193-194
     /// Return the number of elements in the table.
-    size_t size() const { return Mappings.size(); }
+    size_t size() const {
+      return llvm::count_if(Mappings, [](MappingType &Mapping) {
+        return Mapping.Type != MappingType::Removed;
----------------
rupprecht wrote:
> jhenderson wrote:
> > abrachet wrote:
> > > rupprecht wrote:
> > > > Many of these methods are now linear, meaning this library isn't feasible to operate on files that have a large number of symbols (especially if you end up calling these methods in a loop and go quadratic). I think this is fine to experiment with but should be noted, so it can be fixed before adding uses of it.
> > > FWIW, I think `size` is the only method which gets its complexity changed, this can be solved by just keeping a running count, and having `add` and `remove` methods change `Size`. Does that sound ok?
> > > 
> > > Methods like `getMutableSection` index into the table without any checks to deleted entries so those are all O(1), I'm trying to think of others which became O(N^2) but I cannot think of them.
> > Keeping a running count would make sense to me, if it improves the complexity of other functions.
> `getNextIndex()` is also linear (with the linear factor being # of removed symbols), making `getFirstIndex()` linear, which makes all the `*_begin()` iterator methods linear.
> 
> See rL354597 for a real world motivation with llvm-objcopy. An easy way to need to "remove" a lot of symbols is to extract a single section with `llvm-objcopy --only-section .foo` which will implicitly delete symbols not in that section. If the number of symbols not in that section is large, it will be slow if you keep having to iterate over all removed symbols. (See the test case here if you're interested: https://reviews.llvm.org/D58296?vs=on&id=187690)
> 
> That said, there isn't necessarily a need to prematurely optimize this. I think it's fine to just document that the methods are linear, and if we run into regressions when using it on real scenarios, we can rework the data structures.
> Keeping a running count would make sense to me, if it improves the complexity of other functions.
I think that `MutableTable::size()` isn't actually even used anywhere anymore I just updated it because it was used previously but its use was replaced by `getEndIndex()`. Should I just remove it then? Keeping a count of the number of elements would only improve the complexity of `size()` I think.

>getNextIndex() is also linear (with the linear factor being # of removed symbols), making getFirstIndex() linear, which makes all the *_begin() iterator methods linear.
Oh, right.

>See rL354597 for a real world motivation with llvm-objcopy. An easy way to need to "remove" a lot of symbols is to extract a single section with llvm-objcopy --only-section .foo which will implicitly delete symbols not in that section. If the number of symbols not in that section is large, it will be slow if you keep having to iterate over all removed symbols. (See the test case here if you're interested: https://reviews.llvm.org/D58296?vs=on&id=187690)
FWIW currently the deletion of a symbol I think is linear (the use of remove_if) and has the cost of copying the elements where this doesn't have that same cost. Certainly it will be slower if iteration happens many times after.

> That said, there isn't necessarily a need to prematurely optimize this. I think it's fine to just document that the methods are linear, and if we run into regressions when using it on real scenarios, we can rework the data structures.
Absolutely :)

================
Comment at: llvm/include/llvm/Object/MutableELFObject.h:134
+      assert(CurrentIndex < Mappings.size());
+      return *llvm::find_if(seq(CurrentIndex + 1, (uint64_t)Mappings.size()),
+                            [this](uint64_t Index) {
----------------
jhenderson wrote:
> Maybe the cast indicates that these indexes should be size_t everywhere, since they're indexes into a container? What do you think?
Sounds good. I have changed to `size_t` it does make more sense.

================
Comment at: llvm/unittests/Object/MutableELFObjectTest.cpp:28
+  if (MatchLen)
+    if ((size_t)std::distance(Begin, End) != List.size())
+      return false;
----------------
jhenderson wrote:
> `std::distance` already returns a size_t. No need to cast it.
I thought so too but it returns the iterators `difference_type` which on most containers is `ptrdiff_t`. That's a really strange decision by the committee I think, I don't see the rational of not using `size_t` given `distance` will never be negative.

================
Comment at: llvm/unittests/Object/MutableELFObjectTest.cpp:482-486
+  // 4 additional sections, index 0, .symtab, .strtab and .shstrtab are added
+  // by yaml2obj.
+  EXPECT_EQ(
+      std::distance(MutableObject.section_begin(), MutableObject.section_end()),
+      6);
----------------
labath wrote:
> abrachet wrote:
> > rupprecht wrote:
> > > I think writing a test matcher would make this more self-documenting than a comment.
> > > e.g.
> > > `EXPECT_THAT(MutableObject, SectionsAre({".remove_me", ..., }));`
> > > which would check both std::distance and each section name by iterating over all sections.
> > > 
> > > Writing a test helper (instead of a full blown matcher) would also work; you might not need a thorough matcher outside of this test.
> > I created this `MatchesRange` template function. It's not as cool as what you were suggesting but I couldn't figure out how to make a gtest matcher.
> You could consider something like this:
> ```
> std::vector<StringRef> Sections;
> transform(MutableObject.sections(), [](??? S) { return S->getName(); }, std::back_inserter(Sections));
> EXPECT_THAT(Sections, testing::ElementsAre(".foo", ".bar", ...));
> ```
> The reason I would prefer that over your current solution is because this will give you a mostly reasonable error message if the check fails, whereas here you'll just get "false is not true"(tm). (You could even make a utility function to hide the `transform` stuff and make the check a one-liner.)
Oh that's much better, before when I didn't know the order `yaml2obj` was putting the sections in I had to use `lldb` because all I got was "false is not true" like you said. Thanks!

> You could even make a utility function to hide the transform stuff and make the check a one-liner.
I'll probably do this later when I use it in more than one place.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D65633/new/

https://reviews.llvm.org/D65633