[PATCH] D131589: [llvm-objdump] Handle multiple syms at same addr in disassembly.

Simon Tatham via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Aug 11 06:19:47 PDT 2022


simon_tatham marked 11 inline comments as done.
simon_tatham added a comment.

In D131589#3714216 <https://reviews.llvm.org/D131589#3714216>, @rafauler wrote:

> I'm not the code owner of these disassembly tools

Several of the reviewers I picked for this patch were authors of the specific parts of the code that I was doing something complicated to. According to git, you were involved in setting up the `--disassemble-symbols` system in the first place, so I particularly value your input on whether there's any intentional feature of its semantics that I've broken without noticing :-)

(The `onSymbolStart` mechanism is the other one that I'm concerned about, because neither of its use cases is familiar to me. So I tried to pick reviewers who know something about that, as well.)



================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1494-1497
+      {
+        while (SI != SE && Symbols[SI].Addr == Start)
+          SI++;
+        SymbolsHere = ArrayRef<SymbolInfoTy>(&Symbols[FirstSI], SI - FirstSI);
----------------
jhenderson wrote:
> rafauler wrote:
> > Curious about the style here where we define a scope to isolate a piece of computation. I don't think I've seen this in other LLVM files or in the coding style, so I'm curious what other people think about it.
> > 
> > I do think it makes the code more clear to understand here, because it's only 3 lines. But on the other hand, it increases nesting on line 1509, which can actually make code harder to read (arguably).
> I'm not sure this style is used in LLVM, or at least not in the areas I'm familiar with, so I'd drop it.
Oops, good catch. Those braces were originally there to isolate some //variables// so that they weren't in scope for the rest of the enormous for-loop body. But by the time I finished developing the patch I'd removed all the variables local to the block, and somehow didn't notice that even in three last-minute pre-upload reviews of my own :-)

Now I look closely, the same thing happened to the other scope that decides which symbols to print. I'll fix that one too.


================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1501
+      // Get the demangled names of all those symbols.
+      std::vector<std::string> SymNamesHere;
+      for (const SymbolInfoTy &Symbol : SymbolsHere) {
----------------
jhenderson wrote:
> Can this be a vector of `StringRef` to save the copying?
Not trivially, because in the case where we have to demangle names, `demangle()` returns a newly made string and we have to have somewhere to store it.

I've changed it so that we have the vector of `StringRef` you wanted, and //also// a vector of `std::string` which is left empty when not in demangling mode, and the StringRefs point at the local vector or the original name elsewhere, as appropriate.


================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1511
+        // If the user has given the --disassemble-symbols option, then we must
+        // display every symbol in that set (that we can find at all), and no
+        // others.
----------------
jhenderson wrote:
> "(that we can find at all)" - I'm not sure what this is saying. Do you even need it?
I just meant that if a symbol specified by the user doesn't appear in the object file at all, then we're exempt from the need to display it. But I agree that's not 100% clear, or particularly important to highlight in this context. I'll remove the parenthesis.


================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1542
+        if (ShowAllSymbols) {
+          for (unsigned i = 0; i < SymbolsHere.size(); ++i)
+            SymsToPrint[i] = true;
----------------
jhenderson wrote:
> `size_t` would be the more natural type for loops over `SymbolsHere`, since that's the value returned by `size()` and used by the index operator.
> 
> Ditto below.
I agree! But I guessed that the widespread existing use of `unsigned` in similar cases in the LLVM code base (for example, this very loop where `SI` ranges up to `Symbols.size()`) was a local idiom that I'd be criticised for going against. I'm glad to see that the opposite is true ;-)


================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1556-1557
       uint64_t End = std::min<uint64_t>(SectionAddr + SectSize, StopAddress);
-      if (SI + 1 < SE)
-        End = std::min(End, Symbols[SI + 1].Addr);
+      if (SI < SE)
+        End = std::min(End, Symbols[SI].Addr);
       if (Start >= End || End <= StartAddress)
----------------
jhenderson wrote:
> It's not immediately obvious to me why this line has changed. Could you explain please, as it likely means I've missed something.
In the existing version of the loop, `SI` is incremented //after// the loop body runs, by the `++SI` in the `for` statement itself. So throughout the loop body, `SI` points at the symbol we're currently disassembling, and `SI+1` here indicates the //next// symbol, whose address marks the point where we're planning to finish this iteration of the disassembly loop.

In the new version, I've removed the `++SI` in the `for` statement, and replaced it with code at the beginning of the loop body that advances `SI` past all the symbols defined at the same address. So after that code runs, the rest of the loop body sees `SI` //already// pointing at the first symbol defined at a later address.


================
Comment at: llvm/tools/llvm-objdump/llvm-objdump.cpp:1655
       if (Obj.isELF() && !DisassembleAll && Section.isText()) {
-        uint8_t SymTy = Symbols[SI].Type;
-        if (SymTy == ELF::STT_OBJECT || SymTy == ELF::STT_COMMON) {
-          dumpELFData(SectionAddr, Index, End, Bytes);
-          Index = End;
+        for (const SymbolInfoTy &Symbol : SymbolsHere) {
+          uint8_t SymTy = Symbol.Type;
----------------
jhenderson wrote:
> Is there a subtle behaviour change here if you have multiple symbols at the same address but different types (i.e. one is STT_OBJECT and one isn't, e.g. STT_FUNC)?
Potentially, yes. Previously, `llvm-objdump` would pick just one of the symbols defined at the address, and base its decision on that symbol alone. With this patch, it will go through all of them, and spots any STT_OBJECT even if it's not the symbol last in the sorted list.

This is just the sort of thing I hoped to have a useful discussion about in order to decide what the behaviour //should// be, to avoid the risk of writing oodles of code to implement a complicated policy that we had no consensus on :-) so thanks for flagging it up.

What do you think we //should// do if an STT_FUNC and an STT_OBJECT occur at the same address? llvm-objdump's existing policy  doesn't look particularly deliberate to me – it's an artefact of the code's previous lack of attention to collocated symbols. Perhaps it's nonetheless best to stick with the existing policy just for stability's sake, but if so, I'd prefer that we'd discussed other options before deciding that.

Other possibilities that spring to mind are to deliberately make STT_OBJECT highest priority (which is what's happening in this version of the code), or to make it lowest priority, or to choose based on some criterion like symbol index in the ELF file (go with whichever symbol was first/last in the actual object file's symtab). And maybe, whichever of those we do, emit a warning that flags up that we had to make an arbitrary decision that could have gone the other way.

What do you think?

(PS I hope you're not going to like the symtab index idea, because that information isn't preserved at all in `SymbolInfoTy` so it would take a load more plumbing :-)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131589/new/

https://reviews.llvm.org/D131589



More information about the llvm-commits mailing list