[PATCH] D96632: [THUMB2] add .w suffixes for ldr/str w/ immediates

Mon Feb 15 03:58:28 PST 2021

DavidSpickett added a comment.

TLDR: I tried to track down the meaning of .w for the assembler and the encodings. They seem to be slightly different but I think the assembler side is clear and that's what we need here.

The explanation of .w in the armarm:

  A8.2 Standard assembler syntax fields

  Specifies optional assembler qualifiers on the instruction.

  The following qualifiers are defined:
  .N Meaning narrow, specifies that the assembler must select a 16-bit encoding for the
  instruction. If this is not possible, an assembler error is produced.

  .W Meaning wide, specifies that the assembler must select a 32-bit encoding for the
  instruction. If this is not possible, an assembler error is produced.

  If neither .W nor .N is specified, the assembler can select either 16-bit or 32-bit encodings. If both
  are available, it must select a 16-bit encoding. In a few cases, more than one encoding of the same
  length can be available for an instruction. The rules for selecting between such encodings are
  instruction-specific and are part of the instruction description

Which explains the assembler's point of view. Less clear, is what a .W means in an instruction description.

  A8.1.3 Instruction encodings

  For example, the assembler
  syntax documented for the 32-bit Thumb AND (register) encoding includes the .W qualifier to ensure that the
  32-bit encoding is selected even for the small proportion of operand combinations for which the 16-bit
  encoding is also available.

Which is not a great example because when I tested this with gcc, it would prefer 16 bit encodings in most situations
that I could come up with. However, AND has conditions for being in or out of an IT block so that might be higher priority than any .w encoding.

For LDR there's no IT block stuff to consider, but gcc still prefers the smaller encoding if possible.

  .syntax unified

  // This could fit into 16 bit T1, but should prefer T3 with the .w?
  // Uses T1
  // (which seems to contradict the AND example but hey)
  ldr r3, [r1, #124]
  // Will be T3 due to us adding .w
  ldr.w r3, [r1, #124]
  // This can only fit into T3 because of the 12 bit immediate
  ldr r3, [r1, #0x1ff]
  // T3 because it has the .W suffix in the description, could fit into T4 as well
  ldr.w r3, [r1, #255]
  // Uses T4 due to post increment
  ldr.w r3, [r1], #255

So my best guess is that I am misunderstanding the AND example and that .W marks the preferred 32 bit thumb encoding.

The assembler will:

- try to use 16 bits
- try to use 32 bits
- if there are multiple 32 bit encodings that could apply, use the one with .W

Note that there are some instructions like ADR where T2 and T3 have .W. In this case (and I assume other cases) the two clearly wouldn't overlap so it's still decidable which one should be used.
Also, some instructions have only one 32 bit encoding, which has .W. I expect this just makes the logic easier if you can say "every instruction that has 32 bit encodings has a preferred one".

This is all to support reassembly to the same bit patterns. If we assemble "ldr.w r3, [r1, #255]" and get T3, then disassemble, we get "ldr.w r3, [r1, #255]". Reassembling that we want to know we'll get T3 not T4.

Caveat: I can't cite anywhere in the armarm that says ".W means preferred 32 bit". That's just the conclusion I'm drawing from what I've found.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96632/new/

https://reviews.llvm.org/D96632