<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jan 26, 2017 at 7:56 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="gmail-">On Tue, Jan 24, 2017 at 11:29 AM, Rui Ueyama <span dir="ltr"><<a href="mailto:ruiu@google.com" target="_blank">ruiu@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Well, maybe, we should just change the Linux kernel instead of tweaking our tokenizer too hard.</div></blockquote><div><br></div></span><div>This is silly. Writing a simple and maintainable lexer is not hard (look e.g. at <a href="https://reviews.llvm.org/D10817" target="_blank">https://reviews.llvm.org/<wbr>D10817</a>). There are some complicated context-sensitive cases in linker scripts that break our approach of tokenizing up front (so we might want to hold off on), but we aren't going to die from implementing enough to lex basic arithmetic expressions independent of whitespace.</div></div></div></div></blockquote><div><br></div><div>Hmm..., the crux of not being able to lex arithmetic expressions seems to be due to lack of context sensitivity. E.g. consider `foo*bar`. Could be a multiplication, or could be a glob pattern.</div><div><br></div><div>Looking at the code more closely, adding context sensitivity wouldn't be that hard. In fact, our ScriptParserBase class is actually a lexer (look at the interface; it is a lexer's interface). It shouldn't be hard to change from an up-front tokenization to a more normal lexer approach of scanning the text for each call that wants the next token. Roughly speaking, just take the body of the for loop inside ScriptParserBase::tokenize and add a helper which does that on the fly and is called by consume/next/etc. Instead of an index into a token vector, just keep a `const char *` pointer that we advance.</div><div><br></div><div>Once that is done, we can easily add a `nextArithmeticToken` or something like that which just lexes with different rules.</div><div><br></div><div><br></div><div>Implementing a linker is much harder than implementing a lexer. If we give our users the impression that implementing a compatible lexer is hard for us, what impression will we give them about the linker's implementation quality? If we can afford 100 lines of self-contained code to implement a concurrent hash table; we can afford 100 self-contained lines to implement a context-sensitive lexer. This is end-user visible functionality; we should be careful skimping on it in the name of simplicity.</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>We will be laughed at. ("You seriously couldn't even be bothered to implement a real lexer?")</div><span class="gmail-HOEnZb"><font color="#888888"><div><br></div><div>-- Sean Silva</div></font></span><span class="gmail-"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div class="gmail-m_-199790874614197034gmail-h5"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 24, 2017 at 7:57 AM, George Rimar <span dir="ltr"><<a href="mailto:grimar@accesssoftek.com" target="_blank">grimar@accesssoftek.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr" style="font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255);font-family:calibri,arial,helvetica,sans-serif">
<p><span style="color:rgb(33,33,33);font-size:12pt">>Our tokenizer recognize</span></p>
<div dir="ltr" style="font-size:12pt;color:rgb(0,0,0);background-color:rgb(255,255,255)">
<div>
<div style="color:rgb(33,33,33)">
<div>
<div dir="ltr"><span>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">>
</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif">> <font face="monospace, monospace">[A-Za-z0-9_.$/\\~=+[]*?\-:!<><wbr>]+</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif">> </div>
<div style="font-family:calibri,arial,helvetica,sans-serif">>as a token. gold uses more complex rules to tokenize. I don't think we need that much complex rules, but there seems to be >room to improve our tokenizer. In particular, I believe we can parse
the Linux's linker script by changing the tokenizer rules as >follows.</div>
<div style="font-family:calibri,arial,helvetica,sans-serif">> </div>
</span><div><span>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">> [A-Za-z_.$/\\~=+[]*?\-:!<>][A<wbr>-Za-z0-9_.$/\\~=+[]*?\-:!<>]*</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif">> </div>
<div style="font-family:calibri,arial,helvetica,sans-serif">>or</div>
<div style="font-family:calibri,arial,helvetica,sans-serif">> </div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">> [0-9]+</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace"><br>
</font></div>
</span><div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">After more investigation, that seems will not work so simple.</font></div>
<div><font face="monospace, monospace">Next are possible examples where it will be broken:</font></div>
<div><font face="monospace, monospace">. = 0x1000; (gives tokens "0, x1000")</font></div>
<div><font face="monospace, monospace">. = A*10; (gives "A*10")</font></div>
<div><font face="monospace, monospace">. = 10k; (gives "10, k")</font></div>
<div><font face="monospace, monospace">. = 10*5; (gives "10, *5"</font></div>
<div><font face="monospace, monospace"><br>
</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">"[0-9]+" could be "[0-9][kmhKMHx0-9]*"</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">but for "10*5" that anyways gives "10" and "*5" tokens.</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">And I do not think we can involve some handling of operators,</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">as <span style="color:rgb(33,33,33);font-family:monospace,monospace;font-size:16px;background-color:rgb(255,255,255)">its hard to assume some context</span>
on tokenizing step.</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">We do not know if that a file name we are parsing or a math expression.</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace"><br>
</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">May be worth trying to handle this on higher level, during evaluation of</font></div>
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace">expressions ?</font></div><span class="gmail-m_-199790874614197034gmail-m_5520829228765660322HOEnZb"><font color="#888888">
<div style="font-family:calibri,arial,helvetica,sans-serif"><font face="monospace, monospace"><br>
</font></div>
</font></span></div><span class="gmail-m_-199790874614197034gmail-m_5520829228765660322HOEnZb"><font color="#888888">
<div class="gmail_extra" style="font-family:calibri,arial,helvetica,sans-serif">
George.<br>
</div>
</font></span></div>
</div>
</div>
</div>
</div>
</div>
</blockquote></div><br></div></div></div></div>
</blockquote></span></div><br></div></div>
</blockquote></div><br></div></div>