[PATCH] D130357: [MC,llvm-objdump,ARM] Target-dependent disassembly resync policy.

Mon Jul 25 06:59:24 PDT 2022

simon_tatham added inline comments.

================
Comment at: llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h:181
+  ///
+  /// \param Address  - The address, in the memory space of region, of the first
+  ///                   byte of the symbol.
----------------
DavidSpickett wrote:
> "memory space of the region" perhaps?
Oh yes, I copy-pasted the parameter comments from one of the existing API functions just above it without noticing that they didn't all say quite the same things or make sense in all cases.

================
Comment at: llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h:182
+  /// \param Address  - The address, in the memory space of region, of the first
+  ///                   byte of the symbol.
+  /// \param Bytes    - A reference to the actual bytes at the symbol location.
----------------
DavidSpickett wrote:
> What is "the symbol" here? Maybe from context it will be a symbol but it seems to me like you'd be in the middle of some function maybe. So symbol is "main" but I'm N bytes away from where it starts by this point.
> 
> Do I have the wrong end of the stick?
Another copy-paste goof.

================
Comment at: llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h:183
+  ///                   byte of the symbol.
+  /// \param Bytes    - A reference to the actual bytes at the symbol location.
+  /// \return         - A number of bytes to skip. Must always be greater than
----------------
DavidSpickett wrote:
> I'd explain why the parameter exists. You'll probably end up repeating the thumb justification but hey why not.
> 
> Also is the size of this known to be in some range or does the target also decide that. E.g. if thumb needed 4 bytes of context to decide, the Arm backend would pass this 4 bytes. Some other target could choose less or more or none.
Where I've called it, I just passed all the data available. If that's not enough to make the best choice, then the caller can't magic more data out of the air anyway, so the callee will just have to do the best it can (which in Thumb I decided was advancing to the next multiple of 2 bytes – still better than the previous 1!)

================
Comment at: llvm/include/llvm/MC/MCDisassembler/MCDisassembler.h:186
+  ///                   zero. May be greater than the size of Bytes.
+  virtual uint64_t suggestBytesToSkip(ArrayRef<uint8_t> Bytes,
+                                      uint64_t Address) const;
----------------
DavidSpickett wrote:
> `const ArrayRef<uint8_t>`?
No, because firstly, that would only make the tiny `ArrayRef` structure itself const and not the pointed-to data, and secondly, `ArrayRef` is implicitly a pointer-to-const anyway.

(This too is copied from the previous API functions, so if it had been a bug, they'd have this bug as well.)

================
Comment at: llvm/lib/Target/AArch64/Disassembler/AArch64Disassembler.cpp:353

+uint64_t AArch64Disassembler::suggestBytesToSkip(ArrayRef<uint8_t> Bytes,
+                                                 uint64_t Address) const {
----------------
DavidSpickett wrote:
> We're assuming that Address is already aligned, I assume that's safe given that we're either:
> * disassembling from the start of a fn, which would be aligned
> * have just disassembled an instruction, which would have been aligned, and 4 bytes in size.
> 
> Correct?
> 
> I guess someone could give MC a misaligned function start but garbage in garbage out in that case? Or would you want to align up to the nearest 4 bytes.
Mmmm. I did wonder about specifying `suggestBytesToSkip` so that you could give it a totally arbitrary alignment and it would do something sensible. But I didn't like the idea that every single implementation (other than the default 'skip 1 byte because anything's a valid start position') would end up having to contain very similar boilerplate. That struck me as a sign of having put the API boundary in the wrong place.

In this patch there are only two implementations of `suggestBytesToSkip` (or two and a half if you count the Arm/Thumb branches of the AArch32 one). But I expect that most non-x86 targets will end up wanting to do something sensible in here – surely a lot of targets are fixed-alignment RISC style, and even m68k has an alignment constraint if I remember my Amiga-owning days correctly. So there'd end up being a lot of copies of that code!

================
Comment at: llvm/lib/Target/ARM/Disassembler/ARMDisassembler.cpp:758
+      if (Bytes.size() >= 2) {
+        uint16_t Insn16 = (Bytes[1] << 8) | Bytes[0];
+        return Insn16 < 0xE800 ? 2 : 4;
----------------
DavidSpickett wrote:
> Is the in memory layout of thumb instructions the same between endians?
In the modern (BE8-style) Arm architecture, yes: instructions are stored little-endian regardless of the data endianness. In A32 that means little-endian 32-bit words, and in T32 it means little-endian 16-bit halfwords.

In older versions of the architecture that was different, if I remember rightly. But the rest of LLVM doesn't support those versions either, because this code is taken directly from the code in the same file that extracts the halfword for actual disassembly.

================
Comment at: llvm/test/tools/llvm-objdump/ELF/ARM/unknown-instr-resync.test:1
+# RUN: yaml2obj %s | llvm-objdump -d --mcpu=cortex-a8 - | FileCheck %s
+
----------------
DavidSpickett wrote:
> I think this one is the thumb side of the testing, can you add that in a comment in the file if so. Judging by the skip 4 then skip 2 later.
> 
> (if this is the thumb test do you need another one for Arm only?)
It's both – the first three instructions are Arm, including an unrecognised Arm instruction, and the next six are Thumb, including a 16- and a 32-bit unrecognised Thumb instruction.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130357/new/

https://reviews.llvm.org/D130357