[LLVMdev] Increase the flexibility of the AsmLexer in parsing identifiers.

Marcello Maggioni hayarms at gmail.com
Wed Nov 12 09:23:25 PST 2014


Hello,

I would like to gather some ideas and opinions on how to make the default
AsmLexer more flexible when dealing with Identifiers.

When the lexer emits something as an "Identifier" (read. String of
characters) it means that it needs to be parsed all at once in a single go,
even if it contains elements that might be wanted to be parsed as separate
entities.
In that case it is needed to implement some custom parsing logic that lexes
and parses in place the identifier string to emit the Operands in the
operand vector, which might not be ideal.

At the moment the default AsmLexer lexes tokens like this:

There are a bunch of symbols that are parsed directly into tokens(like #, %
... etc), then there are integer/float literals and a fairly big category
that catches the default case that doesn't match any of the previous, that
are handled by the LexIdentifier() function.

Actually in the current default AsmLexer this function doesn't always emit
an Identifier token, but might return Float literals or Dot tokens in some
special cases, so it works more like a "handle what I couldn't directly
recognize" kind of function.

In multiple occasions I found like I wanted to be able to change what
actually this function considers an Identifier or separate tokens.

A use case would be this.

Let's say that my target's assembly syntax has this fancy characteristic
where different operands are separated by '$' (dollar) like in:

add r0$5$r3

The default AsmLexer would lex the entire r0$5$r3 as a single "Identifier"
and it is not possible to Lex every operand separately , but some custom
lexing logic must be applied over the returned "Identifier" Token to split
and recognize each of the operands.

This is a stupid example, but there are other cases where something similar
happens and can be a hassle to deal with, because what an Identifier is
entirely dependent from some arbitrary logic in the Lexer.

To override this logic the entire default Lexer and Parser needs to be
overridden (probably copying most of the existing logic for the rest of the
parsing anyway).

I would like to find a more easy way to specify what to return as an
identifier or separate logic allowing for more flexibility.

I developed a tentative patch that adds this flexibility to the current
MCAsmLexer infrastructure.
I would like to gather opinions on this approach or ideas on other possible
approaches to achieve something similar and find out if somebody else finds
this kind of concept useful or not.

Thanks,
Marcello
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/d53b4ebb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configurable_asmlexer.patch
Type: application/octet-stream
Size: 5351 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/d53b4ebb/attachment.obj>


More information about the llvm-dev mailing list