[llvm-dev] Linking Linux kernel with LLD
Sean Silva via llvm-dev
llvm-dev at lists.llvm.org
Fri Jan 27 01:08:23 PST 2017
On Thu, Jan 26, 2017 at 7:56 PM, Sean Silva <chisophugis at gmail.com> wrote:
> On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <ruiu at google.com> wrote:
>> Well, maybe, we should just change the Linux kernel instead of tweaking
>> our tokenizer too hard.
> This is silly. Writing a simple and maintainable lexer is not hard (look
> e.g. at https://reviews.llvm.org/D10817). There are some complicated
> context-sensitive cases in linker scripts that break our approach of
> tokenizing up front (so we might want to hold off on), but we aren't going
> to die from implementing enough to lex basic arithmetic expressions
> independent of whitespace.
Hmm..., the crux of not being able to lex arithmetic expressions seems to
be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a
multiplication, or could be a glob pattern.
Looking at the code more closely, adding context sensitivity wouldn't be
that hard. In fact, our ScriptParserBase class is actually a lexer (look at
the interface; it is a lexer's interface). It shouldn't be hard to change
from an up-front tokenization to a more normal lexer approach of scanning
the text for each call that wants the next token. Roughly speaking, just
take the body of the for loop inside ScriptParserBase::tokenize and add a
helper which does that on the fly and is called by consume/next/etc.
Instead of an index into a token vector, just keep a `const char *` pointer
that we advance.
Once that is done, we can easily add a `nextArithmeticToken` or something
like that which just lexes with different rules.
Implementing a linker is much harder than implementing a lexer. If we give
our users the impression that implementing a compatible lexer is hard for
us, what impression will we give them about the linker's implementation
quality? If we can afford 100 lines of self-contained code to implement a
concurrent hash table; we can afford 100 self-contained lines to implement
a context-sensitive lexer. This is end-user visible functionality; we
should be careful skimping on it in the name of simplicity.
-- Sean Silva
> We will be laughed at. ("You seriously couldn't even be bothered to
> implement a real lexer?")
> -- Sean Silva
>> On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <grimar at accesssoftek.com>
>>> >Our tokenizer recognize
>>> > [A-Za-z0-9_.$/\\~=+*?\-:!<>]+
>>> >as a token. gold uses more complex rules to tokenize. I don't think we
>>> need that much complex rules, but there seems to be >room to improve our
>>> tokenizer. In particular, I believe we can parse the Linux's linker script
>>> by changing the tokenizer rules as >follows.
>>> > [A-Za-z_.$/\\~=+*?\-:!<>][A-Za-z0-9_.$/\\~=+*?\-:!<>]*
>>> > [0-9]+
>>> After more investigation, that seems will not work so simple.
>>> Next are possible examples where it will be broken:
>>> . = 0x1000; (gives tokens "0, x1000")
>>> . = A*10; (gives "A*10")
>>> . = 10k; (gives "10, k")
>>> . = 10*5; (gives "10, *5"
>>> "[0-9]+" could be "[0-9][kmhKMHx0-9]*"
>>> but for "10*5" that anyways gives "10" and "*5" tokens.
>>> And I do not think we can involve some handling of operators,
>>> as its hard to assume some context on tokenizing step.
>>> We do not know if that a file name we are parsing or a math expression.
>>> May be worth trying to handle this on higher level, during evaluation of
>>> expressions ?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev