[llvm-dev] Linking Linux kernel with LLD

Fri Jan 27 01:08:23 PST 2017

On Thu, Jan 26, 2017 at 7:56 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote:
>
>> Well, maybe, we should just change the Linux kernel instead of tweaking
>> our tokenizer too hard.
>>
>
> This is silly. Writing a simple and maintainable lexer is not hard (look
> e.g. at https://reviews.llvm.org/D10817). There are some complicated
> context-sensitive cases in linker scripts that break our approach of
> tokenizing up front (so we might want to hold off on), but we aren't going
> to die from implementing enough to lex basic arithmetic expressions
> independent of whitespace.
>

Hmm..., the crux of not being able to lex arithmetic expressions seems to
be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a
multiplication, or could be a glob pattern.

Looking at the code more closely, adding context sensitivity wouldn't be
that hard. In fact, our ScriptParserBase class is actually a lexer (look at
the interface; it is a lexer's interface). It shouldn't be hard to change
from an up-front tokenization to a more normal lexer approach of scanning
the text for each call that wants the next token. Roughly speaking, just
take the body of the for loop inside ScriptParserBase::tokenize and add a
helper which does that on the fly and is called by consume/next/etc.
Instead of an index into a token vector, just keep a `const char *` pointer
that we advance.

Once that is done, we can easily add a `nextArithmeticToken` or something
like that which just lexes with different rules.

Implementing a linker is much harder than implementing a lexer. If we give
our users the impression that implementing a compatible lexer is hard for
us, what impression will we give them about the linker's implementation
quality? If we can afford 100 lines of self-contained code to implement a
concurrent hash table; we can afford 100 self-contained lines to implement
a context-sensitive lexer. This is end-user visible functionality; we
should be careful skimping on it in the name of simplicity.

-- Sean Silva

>
> We will be laughed at. ("You seriously couldn't even be bothered to
> implement a real lexer?")
>
> -- Sean Silva
>
>
>>
>>
>> On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com>
>> wrote:
>>
>>> >Our tokenizer recognize
>>> >
>>> >  [A-Za-z0-9_.$/\\~=+[]*?\-:!<>]+
>>> >
>>> >as a token. gold uses more complex rules to tokenize. I don't think we
>>> need that much complex rules, but there seems to be >room to improve our
>>> tokenizer. In particular, I believe we can parse the Linux's linker script
>>> by changing the tokenizer rules as >follows.
>>> >
>>> >  [A-Za-z_.$/\\~=+[]*?\-:!<>][A-Za-z0-9_.$/\\~=+[]*?\-:!<>]*
>>> >
>>> >or
>>> >
>>> >  [0-9]+
>>>
>>> After more investigation, that seems will not work so simple.
>>> Next are possible examples where it will be broken:
>>> . = 0x1000; (gives tokens "0, x1000")
>>> . = A*10;   (gives "A*10")
>>> . = 10k;    (gives "10, k")
>>> . = 10*5;   (gives "10, *5"
>>>
>>> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
>>> but for "10*5" that anyways gives "10" and "*5" tokens.
>>> And I do not think we can involve some handling of operators,
>>> as its hard to assume some context on tokenizing step.
>>> We do not know if that a file name we are parsing or a math expression.
>>>
>>> May be worth trying to handle this on higher level, during evaluation of
>>> expressions ?
>>>
>>> George.
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170127/8c9cf106/attachment.html>