[llvm] [MC][ARM] Fix crash when assembling Thumb 'movs r0, #foo'. (PR #115026)

Fri Nov 8 01:27:51 PST 2024

================
@@ -8653,6 +8657,41 @@ bool ARMAsmParser::validateInstruction(MCInst &Inst,
   return false;
 }
 
+// After processInstruction has transformed an instruction being assembled into
+// a different opcode, do any further validity checks that the new opcode
+// depends on.
+//
+// `Inst` contains the final modified form of the instruction, but `Operands`
+// contains the parsed operands from the _original_ instruction, because
+// nothing has updated them (`processInstruction` received them as const).
+// `OrigOpcode` contains the original value of `Inst.getOpcode()`, which should
+// give enough context to know how to understand the original operands list.
+bool ARMAsmParser::revalidateInstruction(MCInst &Inst,
+                                         const OperandVector &Operands,
+                                         unsigned MnemonicOpsEndInd,
+                                         unsigned OrigOpcode) {
+  const unsigned NewOpcode = Inst.getOpcode();
+
+  if (OrigOpcode == ARM::t2ADDri && NewOpcode == ARM::tADDi8) {
+    // t2ADDri is the Thumb 32-bit immediate add instruction, for example
+    // 'add[s] r0,r1,#0xff00'. If its immediate argument isn't a constant
+    // requiring shifting, then processInstruction can turn it into tADDi8, the
+    // simpler 16-bit Thumb immediate add (provided all the other conditions
+    // for that transformation are met). That makes it too late for
+    // validateInstruction to do this check, which it would have done if it had
+    // known from the start that the instruction was tADDi8.
----------------
statham-arm wrote:

It's not _quite_ as weird, because the original instruction was only Thumb2 because of LLVM's decision – the user asked for something much more general-purpose.  It doesn't seem totally unreasonable to me that "adds r0, #extern_symbol" should give the same error whether or not Thumb2 is available. It _looks_ like the same instruction, after all!

I did wonder about trying to do a more complete fix. The real problem case would be something like this (targeting v7-M or v8-M Mainline):
```
adds r0, r0, #foo
adds r1, r1, #bar
.equ foo, :lower0_7:some_extern_symbol
.equ bar, 0xff00
```
If this were ever to assemble successfully, it would have to be done by making the adds to r0 `tADDi8` which can accept the necessary relocation, but the one to r1 `t2ADDri` which can accept the shifted immediate. But the assembler must commit to each instruction's opcode before it sees the necessary data further down the file.

So I don't think a really complete fix is possible without changing the entire structure of how the assembler works. But I can look into the alternative approach of trying to make the `t2ADDri` → `tADDi8` transformation more discriminating, in the hope that we can at least fix cases that _don't_ forward-refer to an `equ`.

https://github.com/llvm/llvm-project/pull/115026