[PATCH] D55881: [llvm-objcopy] [COFF] Add support for removing symbols

Thu Jan 3 02:49:04 PST 2019

jhenderson added a comment.

> llvm-readobj -symbols is way too verbose for a neat test here (it prints 5+ lines per symbol)

Not that it's really relevant here, I think, but llvm-readelf (aka GNU output style for llvm-readobj) produces single line output for each symbol. I've not looked, but I'm guessing that llvm-readobj doesn't have an equivalent for COFF. Maybe that's worth taking a look at at some point? We actually tend to use the verbose output style for the llvm-objcopy ELF tests though.

> Yes, using YAML would remove the dependency on X86 being enabled, and I agree that it makes some things less ambiguous. As for writing (and reading) the tests I prefer assembly, but if there's strong opposition I can make them YAML instead and just use assembly as source locally when crafting the tests.

I don't feel strongly about it, but I would prefer YAML, because of the extra control granted.

> There is, however, at least one case in a test (that depends on D56140 <https://reviews.llvm.org/D56140>) which is impossible to make using YAML as input at the moment; the COFF YAML format doesn't distinguish between multiple symbols with the same name for relocations. To fix that, the COFF YAML format would need to be changed to use symbol indices instead of names for relocations (mirroring what actually is stored in the files).

Hmm... that's an interesting case. I think it accepting arbitrary indexes would be nice. IIRC, the ELF YAML format allows for some fields specifying either a name or index, and if an index, it doesn't do any checking, which allows for crafting invalid inputs to test error handling. Having a symmetry here would make a lot of sense.

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s:11-12
+# RELOCS-NEXT:   Section (1) .text {
+# RELOCS-PRE-NEXT:  0x4 IMAGE_REL_AMD64_REL32 ptr (16)
+# RELOCS-POST-NEXT: 0x4 IMAGE_REL_AMD64_REL32 ptr (15)
+# RELOCS-NEXT:     0xE IMAGE_REL_AMD64_REL32 foo (12)
----------------
Unless there's a strong need, I'd just drop the (16)/(15) bit at the end, as it allows folding these two lines together, and doesn't really give us anything, since there's only one "ptr" symbol.

================
Comment at: test/tools/llvm-objcopy/COFF/X86/remove-symbol1.s:47
+
+# SYMBOLS-POST: SYMBOL TABLE:
+# SYMBOLS-POST-NEXT: .text
----------------
Could you largely fold this and the PRE block together, to reduce risk of copy/paste issues, and make clear the difference?

You could do this by sharing the majority of the checks, but having a single difference for the removed symbol:
```
...
# SYMBOLS-NEXT: foo
# SYMBOLS-PRE-NEXT: otherfunc
# SYMBOLS-NEXT: otherfunc2
...
```

================
Comment at: tools/llvm-objcopy/COFF/COFFObjcopy.cpp:32
+  // If we need to do per-symbol removals, initialize the Referenced field.
+  if (Config.StripUnneeded || Config.DiscardAll ||
+      !Config.SymbolsToRemove.empty())
----------------
You might want to simplify this `if` in this first version to only check the SymbolsToRemove.

================
Comment at: tools/llvm-objcopy/COFF/COFFObjcopy.cpp:40-42
+      // Explicitly removing a referenced symbol is an error, while
+      // implicitly removing them as part of a blanket removal above is
+      // accepted.
----------------
This comment doesn't really make sense in the patch's current form.

================
Comment at: tools/llvm-objcopy/COFF/Object.cpp:38-39
+Error Object::markSymbols() {
+  for (Symbol &Sym : Symbols)
+    Sym.Referenced = false;
+  for (const Section &Sec : Sections) {
----------------
mstorsjo wrote:
> jhenderson wrote:
> > Referenced is initialised to false. Do you need to reset it here as well? I don't envisage a situation where we should need to do this more than once on a single object.
> I'd rather have the method be self-contained enough to clear it here, in case the caller would call this method multiple times. If it doesn't, the method leaks the limitation that it can only be called once, which in itself also is rather encapsulation.
> 
> To make things safer perhaps the field should be initialized to true, to indicate that all symbols are referenced unless we've called this and actually know for sure?
Okay, I see your point, so it's fine what you have done. I have issues both with it being set to true and to false, because each would result in technically incorrect state for a period of time until the function is called. Perhaps it would be best to simply not initialise Referenced at all, until `markSymbols` is called, since the value can't be relied on to be safe without that call.

================
Comment at: tools/llvm-objcopy/COFF/Object.h:46-47
   ArrayRef<uint8_t> AuxData;
+  size_t Key;
+  size_t RawIdx;
+  bool Referenced = false;
----------------
mstorsjo wrote:
> jhenderson wrote:
> > Don't abbreviate unnecessarily (Idx -> Index). Do we really need both separately? If so, perhaps better names would be `OriginalIndex` and `Index` or similar, and then use the input symbol index as the key.
> Ok, I'll expand the unnecessarily terse name.
> 
> As for `Key` vs `OriginalIndex` - yes, input symbol index would work just as well for key (and that's pretty much what I use right now anyway). But as @jakehehrlich wants to have the Object API usable even for constructing object files from scratch, naming it `OriginalIndex` feels a bit limiting.
Okay, no problem. I'd marginally prefer `UniqueID` to `Key`, I think, but I'm okay if you prefer it this way around.

================
Comment at: tools/llvm-objcopy/COFF/Object.h:70
+  // Update SymbolMap and RawIdx in each Symbol. This must be called
+  // after updating the Symbols vector from outside of this class.
+  void updateSymbols();
----------------
mstorsjo wrote:
> jhenderson wrote:
> > This comment seems to indicate an encapsulation issue to me. If this function has to be called any time the vector is updated, perhaps the vector should be private and accessed only through member functions?
> Yes, that'd be a nicer abstraction. That leaves things a bit open wrt how to use it from the Reader though; I could have an add method which adds one symbol at a time and updates the map after each one. That ends up with O(n^2) for the map updates. Alternatively the Reader produces a vector locally and there'd be an `addSymbols(vector<Symbol>)` method here, which adds all at once and does one update of the map at the end.
> 
> What do you think?
I think an `addSymbols` function as you describe would make a lot of sense. A future change might also want a single `addSymbol` function too, but that's just a minor thing that can be added when required.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55881/new/

https://reviews.llvm.org/D55881