[llvm-dev] Linking Linux kernel with LLD

Fri Jan 27 01:26:05 PST 2017

On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com>
wrote:

> >Our tokenizer recognize
> >
> >  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
> >
> >as a token. gold uses more complex rules to tokenize. I don't think we
> need that much complex rules, but there seems to be >room to improve our
> tokenizer. In particular, I believe we can parse the Linux's linker script
> by changing the tokenizer rules as >follows.
> >
> >  [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
> >
> >or
> >
> >  [0-9]+
>
> After more investigation, that seems will not work so simple.
> Next are possible examples where it will be broken:
> . = 0x1000; (gives tokens "0, x1000")
> . = A*10;   (gives "A*10")
> . = 10k;    (gives "10, k")
> . = 10*5;   (gives "10, *5"
>
> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
> but for "10*5" that anyways gives "10" and "*5" tokens.
> And I do not think we can involve some handling of operators,
> as its hard to assume some context on tokenizing step.
> We do not know if that a file name we are parsing or a math expression.
>
> May be worth trying to handle this on higher level, during evaluation of
> expressions ?
>

The lexical format of linker scripts requires a context-sensitive lexer.

Look at how gold does it. IIRC there are 3 cases that are something like:
one is for file-name like things, one is for numbers and stuff, and the
last category is for numbers and stuff but numbers can also include things
like `10k` (I think; would need to look at the code to remember for sure).
It's done in a very elegant way in gold (passing a callback "can continue"
that says which characters can continue the token). Which token regex to
use is dependent on the grammar production (hence context sensitive). If
you look at the other message I sent in this thread just now,
ScriptParserBase is essentially a lexer interface and can be pretty easily
converted to a more standard on-the-fly character-scanning implementation
of a lexer. Once that is done adding a new method to scan a different kind
of token for certain parts of the parser.

-- Sean Silva

>
> George.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/c4637a45/attachment.html>