[PATCH] D14256: [AsmParser] Backends can parameterize ASM tokenization.

Thu Feb 11 13:50:12 PST 2016

colinl added a comment.

The change motivation is because TableGen and the AsmParser recognize different token sets which caused them to tokenize in fundamentally different ways.  Before this change if TableGen was presented with "if(p0) r0 = r0" it would tokenize it as { "if(p0)", "r0", "=", "r0" } and the AsmParser would tokenize it as { "if", "(", "p0", ")", "r0", "=', "r0" }  Before this patch the set of tokens that are part of identifiers was fixed by the tokenizeAsmString switch table shared by all targets.  The first inclination is to add the parenthesis tokens to the switch table but this causes issues in other targets, for instance in X86 the ST(0) register has a hard time being parsed with this change which means different targets have mutually exclusive tokenization rules hence the allowance for targets to specify their tokens.

The existing tokenizeAsmString did three different variants of tokenization.  It discarded certain separator characters "a,b" -> { "a", "b" }  It would tokenize certain characters "a$b" -> { "a", "$", "b" } and it did a weird one-off break of the dot character but only sometimes and would concatenate it to the following identifier "a.b" -> { "a", ".b" }  The change allowed these to be parameterized by each target, something we were already kind of doing for the dot break case with "MnemonicContainsDot" which this patch replaced and something that needed to be done for the mutually exclusive parsing rules.

Merge mistakes aside I haven't heard a proposal or volunteer to make larger design changes to the assembly parser.

Repository:
  rL LLVM

http://reviews.llvm.org/D14256