[llvm-dev] [RFC] Checking inline assembly for validity

Mon Nov 26 02:54:38 PST 2018

GCC-style inline assembly is notoriously hard to write correctly, because it is
the user's responsibility to tell the compiler about the requirements of the
assembly (inputs, output, modified registers, memory access), and getting this
wrong results in silently generating incorrect code. This is also dependent on
register allocation and scheduling decisions made by the compiler, so an inline
assembly statement may appear to work correctly, then silently break when
another change to the code or compiler upgrade causes those decisions to
change.

I've posted a prototype patch at https://reviews.llvm.org/D54891 which tries to
improve this situation by emitting diagnostics when the instructions inside the
inline assembly string do not match the operands to the inline assembly
statement. We can do this because we parse the assembly in the same process as
the compiler, and the MC layer has some knowledge of which registers are read
and written by an assembly instruction.

For example, this C code, which tries to add 3 integers together, looks OK at
first glance:

  int add3(int a, int b, int c) {
    int x;
    asm volatile(
        "add %0, %1, %2; add %0, %0, %3"
      : "=r" (x)
      : "r" (a), "r" (b), "r" (c));
    return x;
  }

However, the compiler is allowed to allocate x (the output operand) and c (an
input operand) to the same register, as it assumes that c will be read before x
is written. This code might happen to work correctly when first written, but a
later change (either in the source code or compiler) could cause register
allocation to change, and this will start silently generating the "wrong" code.

With this patch, the compiler emits this warning for the above code:

  test.cpp:4:7: warning: read from an inline assembly operand after first
                         output operand written, suggest adding early-clobber
                         modifier to output operand [-Winline-asm]
        "add %0, %1, %2; add %0, %0, %3"
        ^
  <inline asm>:1:30: note: instantiated into assembly here
          add r0, r0, r1; add r0, r0, r2
                                      ^
  test.cpp:4:7: note: output operand written here
        "add %0, %1, %2; add %0, %0, %3"
        ^
  <inline asm>:1:6: note: instantiated into assembly here
          add r0, r0, r1; add r0, r0, r2
              ^

The warning is a bit noisy because it prints both the original source code and
final assembly code, and the carets for the original source just point to the
start of the line, but that's a separate issue.

I've designed this to be independent of register allocation decisions, so the
above diagnostic is always reported, regardless of whether x and c were
allocated in the same register or not. This allows it to catch code which
currently works but might break in future.

The way this works is:
- When the AsmParser reaches an INLINEASM MachineInstr, it creates an
  InlineAsmDataflowChecker object, which tracks all of the information we need
  to know about one inline assembly statement.
- The AsmPrinter examines the operands to the MachineInstruction, and calls
  functions on the tracking object to record information about each operand to
  the inline assembly block, as provided by the user and relied on by the
  compiler's optimisations and code-generation.
- While the AsmParser is generating the final assembly string (which involves
  expanding operand templates like "$0" into physical register names), it
  records the offset from the start of the (output) string at which each
  operand expansion appeared.
- The table-generated assembly matcher is modified to record the index of the
  MCParsedAsmOperand which resulted in in the creation of each MCOperand. An
  MCParsedAsmOperand can create multiple MCOperands (for example, a memory
  operand with base and offset), but not the other way round, so this
  information is stored in the MCOperand.
- When the AsmParser is running and a tracking object is present (it is only
  present if we are parsing inline assembly), it records in each ParsedOperand
  the indexes of the inline assembly statement operands which overlap with it.
  This is done using the source location of the just-parsed MCParsedAsmOperand
  and the string offsets recored in the tracking object by the AsmPrinter.
- Finally, after an instruction has been completely parsed, the AsmParser calls
  into the tracking object with the final MCInst and list of
  MCParsedAsmOperands. With all of this information, we can match up inline
  assembly operands to the MCOperands that were created for them, and check
  that they match.

The reason that I'm posting this as an RFC is that it adds a lot of coupling
between different parts of the backend. Do people think this is an acceptable
cost for making a quite user-hostile feature a bit safer? If people agree that
this is worthwhile, there are some things I still need to do before the patch
is ready for a proper review:

- Check the memory overhead (during regular compilation) of storing the parsed
  operand number in the MCOperand. If this is too high, it could be moved to a
  separate data structure only created when parsing inline assmebly.
- There are three different concepts of "operand" here (inline assembly
  operand, MCParsedAsmOperand and MCOperand), tidy up the naming so that these
  are a bit clearer.
- For checks which depend on the order of instructions (for example, checking
  that a clobbered register is only read from if it has previously been written
  to), these could give false positives if there is control-flow in the
  assembly. We could check if an MCInst affects control flow, buffer these
  diagnostics until the end of the assembly block, and only emit them if there
  are no control-flow instructions.