[PATCH] D84191: [llvm-objdump] Symbolize binary addresses for low-noisy asm diff.

Thu Jul 23 00:53:47 PDT 2020

jhenderson added a comment.

It looks like you might still have a binary file in the patch, that is no longer needed?

================
Comment at: llvm/docs/CommandGuide/llvm-objdump.rst:209-212
+    A non-symbolized branch instruction with a local target like `jge 0x20117e <_start+0x25>` will 
+    be printed as `jge	<L0>`.
+    
+    A non-symbolized pc-relative memory access like `dword ptr [rip + 4112]` will be printed as `<g>`.
----------------
I would actively include multi-line output snippets as your examples rather than using prose. Something like:
"
```
  jge 0x20117e <_start+0x25>
  dword ptr [rip + 4112]
```
might become
```
<g>:
  jge  <L0>
  dword ptr [rip + 4112]
<L0>:
```
"
(Please edit the above to reflect a theoretically real output. Also, please use the rst support for code examples. I think there will be other examples, e.g. in the llvm-symbolizer documentation.)

================
Comment at: llvm/docs/CommandGuide/llvm-objdump.rst:206
+
+  Only works with --x86-asm-syntax=Intel and an executable binary.
+
----------------
hoyFB wrote:
> hoyFB wrote:
> > jhenderson wrote:
> > > MaskRay wrote:
> > > > You probably want to mean a linked image (not a relocatable object file).
> > > > 
> > > > Capitalized `Intel` isn't supported. att is much more common and --x86-asm-syntax=intel is rare. As a related fact, LLVM's Intel syntax assembler gets less scrutiny and tends to have some problems.
> > > Is there anything fundamentally preventing this from working with the default assembly syntax or relocatable objects? Or is it just that you haven't implemented it yet?
> > Yes, a linked image is more accurate.
> > 
> > Changed Intel to lower case. 
> > 
> > The att syntax is now supported too. 
> We need a tool to compare executable image since very often the immediate object files are not available. 
> 
> Regarding the assembly syntax, we are mostly comfortable reading the intel syntax. But since the att syntax is more common, I'll add support for that.
I'm not sure you didn't misunderstand my comment earlier. Let's say a user does have a single intermediate object file they want to disassemble, and it has plenty of references within itself. These addresses will be present, thus comparing it against a different version of the same object might produce a noisy diff. It sounds like this option would be helpful there, so why doesn't it work?

================
Comment at: llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml:2-3
+# RUN: yaml2obj %s --docnum=1 -o %t
+# RUN: llvm-objdump %t -d --symbolize-operands --x86-asm-syntax=intel --no-show-raw-insn --no-leading-addr | FileCheck %s --match-full-lines --check-prefix=INTEL
+# RUN: llvm-objdump %t -d --symbolize-operands --x86-asm-syntax=att --no-show-raw-insn --no-leading-addr | FileCheck %s --match-full-lines --check-prefix=ATT
+
----------------
These two lines are quite long. I'd recommend you split them over multiple lines like so:

```
# RUN: llvm-objdump %t -d --symbolize-operands --x86-asm-syntax=att --no-show-raw-insn --no-leading-addr | \
# RUN:   FileCheck %s --match-full-lines --check-prefix=ATT
```

================
Comment at: llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml:9
+# ATT-NEXT: <L1>:
+# ATT-NEXT: cmpl	, %eax <g>
+# ATT-NEXT: jge <L0>
----------------
What's with the spaces before the ',' here.

================
Comment at: llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml:15-22
+# INTEL: <_start>:
+# INTEL-NEXT: push rax
+# INTEL-NEXT: <L1>:
+# INTEL-NEXT: cmp	eax, dword ptr <g>
+# INTEL-NEXT: jge <L0>
+# INTEL-NEXT: jmp <L1>
+# INTEL-NEXT: <L0>:
----------------
Here, and in the ATT example above, I'd add leading spaces, so that everything is lined up like it would be in the actual output. In other words, I'd expect something a bit like:
```
# INTEL:      <_start>:
# INTEL-NEXT:   push rax
# INTEL-NEXT: <L1>:
# INTEL-NEXT:   cmp  eax, dword ptr <g>
# INTEL-NEXT:   jge <L0>
# INTEL-NEXT:   jmp <L1>
# INTEL-NEXT: <L0>:
# INTEL-NEXT:   ret
```

================
Comment at: llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml:44
+    Value:   0x4000  
+  - Name:    g
+    Section: .data
----------------
`g` isn't obviously an arbitrary symbol name to me. I'd suggest using something more obvious, e.g. simply `symbol`.

================
Comment at: llvm/test/tools/llvm-objdump/X86/elf-disassemble-symbololize-operands.yaml:9
+# ATT-NEXT: <L1>:
+# ATT-NEXT: cmpl	, %eax <g>
+# ATT-NEXT: jge <L0>
----------------
hoyFB wrote:
> This is less than ideal because of the current less intrusive implementation. It's probably fine from asm diffing standpoint. To fix this we may want to pass the symbol/label resolution result from the common code into the per-target inst printer and print the symbols right in place for a memory operand. But that may cause changes to the core `MCInstPrinter` interface and the callbacks to initialize target inst printers.
It's not clear to me what is less than ideal here? What would you prefer in an ideal situation?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84191/new/

https://reviews.llvm.org/D84191