[PATCH] D55881: [llvm-objcopy] [COFF] Add support for removing symbols

Wed Jan 2 13:10:18 PST 2019

mstorsjo marked 9 inline comments as done.
mstorsjo added inline comments.

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-local-symbols.s:20-21
+
+# SYMBOLS-POST: 00000000 T mainfunc
+# SYMBOLS-POST: 00000000 d ptr
+# SYMBOLS-POST-EMPTY:
----------------
jhenderson wrote:
> This will produce a false positive, if otherfunc is present, because you've got nothing checking the total number of symbols. I don't know if COFF has a convenient way of testing this, but I would recommend doing so. Otherwise, you should put a -NEXT suffix on this line (and similarly in the above block) - you should also show that there are no symbols before this list if doing things this way, since llvm-objcopy might have reordered the symbols to put otherfunc first instead of removing it.
> 
> Could we also test that two different local symbols are removed? That shows that we remove all locals not just the first.
Indeed, that's true. I somehow thought I tested this stricter than this, but apparently I misremembered.

Yes, some sort of stricter check would indeed be good. I think the current order actually is produced by llvm-nm and not dependent on the source file itself, but avoiding such assumptions is of course better.

Using `-NEXT` would indeed help, but is there anything that would force a match on the first line? `llvm-nm` doesn't print anything at all before the symbols. `llvm-readobj -symbols` is way too verbose for a neat test here (it prints 5+ lines per symbol), but `llvm-objdump -t` turns out to be very compact and detailed, so that's probably even better than this. That output also has got a nice start header that makes the use of `-NEXT` exact.

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-local-symbols.s:1
+# RUN: llvm-mc -triple=x86_64-windows-gnu %s -filetype=obj -o %t.o
+
----------------
jhenderson wrote:
> mstorsjo wrote:
> > alexshap wrote:
> > > so I'm  confused - what decision has been made regarding tests (sorry, many people have commented, I'd like to understand the logic): 
> > > all the new tests will use assembly  while yaml is deprecated, we will use both yaml and assembly, we will use yaml and assembly only in exceptional cases, or what ?
> > > As far as I'm concerned - i want consistency + functionality, at least in the future. 
> > I'm not aware of any other decision other than preferring yaml over binary object files.
> > 
> > I'd strongly prefer to be able to use both yaml and assembly for these tests; yaml is ideal for dumped output from a full toolchain, including debug info and everything that a real object file would contain. Assembly is better for the cases when I want to construct a specific setup to test certain cases, where this keeps the test much smaller and easier to read and understand. And for executables instead of object files, yaml is the only option as we can't rely on lld.
> Yes, to my knowledge, we haven't agreed anything formal re. YAML versus assembly, but I personally would prefer YAML wherever possible. From experience, at least for ELF, assembly often doesn't provide enough low-level control which we need to produce objects that have the properties we want in the input (e.g. what kinds of relocations get emitted etc). The comment about calling `otherfunc` not resulting in a relocation is a perfect example of this. It also removes the need for the "PRE" checks.
> 
> Also, I might be wrong, but wouldn't making this use YAML allow us to drop the X86-requirement here?
Yes, using YAML would remove the dependency on X86 being enabled, and I agree that it makes some things less ambiguous. As for writing (and reading) the tests I prefer assembly, but if there's strong opposition I can make them YAML instead and just use assembly as source locally when crafting the tests.

There is, however, at least one case in a test (that depends on D56140) which is impossible to make using YAML as input at the moment; the COFF YAML format doesn't distinguish between multiple symbols with the same name for relocations. To fix that, the COFF YAML format would need to be changed to use symbol indices instead of names for relocations (mirroring what actually is stored in the files).

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-symbol2.s:6-10
+# SYMBOLS-PRE: 00000000 T mainfunc
+# SYMBOLS-PRE: 0000000c T otherfunc
+# SYMBOLS-PRE: 0000000d T otherfunc2
+# SYMBOLS-PRE: 00000000 d ptr
+# SYMBOLS-PRE-EMPTY:
----------------
jhenderson wrote:
> I don't think this check really gives us anything in this case.
Indeed, this part of the test is pretty pointless.

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-symbol2.s:12
+
+# ERROR: Can't remove referenced symbol ptr
+
----------------
jhenderson wrote:
> It would be nice if this test checked that the file name appears in the error. The symbol name should probably also be quoted in some fashion. At least on ELF, it is possible to have symbol names with spaces in, so the error can be a bit ambiguous!
Ok, will make the code quote it and have the test look for a filename as well.

================
Comment at: tools/llvm-objcopy/COFF/COFFObjcopy.cpp:38
+  Obj.removeSymbols([&](const Symbol &Sym) {
+    if (is_contained(Config.SymbolsToKeep, Sym.Name))
+      return false;
----------------
jhenderson wrote:
> There seems to be an awful lot of code in this function that I don't think is tested. Perhaps it would make more sense to implement it in more patches (e.g. basic explicit symbol removal, then keep, then strip all or whatever etc), and test each part individually.
Fair enough, I can do that.

================
Comment at: tools/llvm-objcopy/COFF/COFFObjcopy.cpp:52
+                                                Sym.Name,
+                                            object_error::parse_failed));
+      return true;
----------------
jhenderson wrote:
> This isn't a parsing failure. Use a more appropriate error value, like `invalid_argument`.
Ok, will use a different error code.

================
Comment at: tools/llvm-objcopy/COFF/Object.cpp:38-39
+Error Object::markSymbols() {
+  for (Symbol &Sym : Symbols)
+    Sym.Referenced = false;
+  for (const Section &Sec : Sections) {
----------------
jhenderson wrote:
> Referenced is initialised to false. Do you need to reset it here as well? I don't envisage a situation where we should need to do this more than once on a single object.
I'd rather have the method be self-contained enough to clear it here, in case the caller would call this method multiple times. If it doesn't, the method leaks the limitation that it can only be called once, which in itself also is rather encapsulation.

To make things safer perhaps the field should be initialized to true, to indicate that all symbols are referenced unless we've called this and actually know for sure?

================
Comment at: tools/llvm-objcopy/COFF/Object.h:46-47
   ArrayRef<uint8_t> AuxData;
+  size_t Key;
+  size_t RawIdx;
+  bool Referenced = false;
----------------
jhenderson wrote:
> Don't abbreviate unnecessarily (Idx -> Index). Do we really need both separately? If so, perhaps better names would be `OriginalIndex` and `Index` or similar, and then use the input symbol index as the key.
Ok, I'll expand the unnecessarily terse name.

As for `Key` vs `OriginalIndex` - yes, input symbol index would work just as well for key (and that's pretty much what I use right now anyway). But as @jakehehrlich wants to have the Object API usable even for constructing object files from scratch, naming it `OriginalIndex` feels a bit limiting.

================
Comment at: tools/llvm-objcopy/COFF/Object.h:70
+  // Update SymbolMap and RawIdx in each Symbol. This must be called
+  // after updating the Symbols vector from outside of this class.
+  void updateSymbols();
----------------
jhenderson wrote:
> This comment seems to indicate an encapsulation issue to me. If this function has to be called any time the vector is updated, perhaps the vector should be private and accessed only through member functions?
Yes, that'd be a nicer abstraction. That leaves things a bit open wrt how to use it from the Reader though; I could have an add method which adds one symbol at a time and updates the map after each one. That ends up with O(n^2) for the map updates. Alternatively the Reader produces a vector locally and there'd be an `addSymbols(vector<Symbol>)` method here, which adds all at once and does one update of the map at the end.

What do you think?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55881/new/

https://reviews.llvm.org/D55881