<div class="gmail_quote">On Fri, Mar 30, 2012 at 14:55, Chris Lattner <span dir="ltr"><<a href="mailto:clattner@apple.com">clattner@apple.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div style="word-wrap:break-word"><div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">+/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by</div>


<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">+/// Annex E of the C++ standard (note: pretty sure this is different in C).</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">


+static inline bool isIdentifierBody(UTF32 c) {</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">+  if (c <= 127)</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">


+    return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true : false;</div><div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">+  else if (c & 0xffff0000)</div>


<div style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0px;font:normal normal normal 11px/normal Menlo">+    return ~c & 0xfffd;</div></div><div><br></div><div>This should be split into two functions: isIdentifierBody() which just handles the <= 127 case, and the rest in a attribute(no_inline) function to handle the slowpath.</div>


<div><br></div><div>We don't want the slow path to prevent inlining of the fast path.  If we have a macro for builtin_expect, it would be worth using it on the "c <= 127" branch.</div><div><br></div><div>


I would also prefer the loop in <span style="font-family:Menlo;font-size:11px">Lexer::LexIdentifier to be written something like this, which might make the idea above irrelevant:</span></div><div><span style="font-family:Menlo;font-size:11px"><br>


</span></div><div><span style="font-family:Menlo;font-size:11px"><br></span></div><div><span style="font-family:Menlo;font-size:11px">do {</span></div><div><span style="font-family:Menlo;font-size:11px">  C = *CurPtr;</span></div>


<div><font face="Menlo"><span style="font-size:11px">  if (C > 127) goto UTFIdentifier;</span></font></div><div><font face="Menlo"><span style="font-size:11px">  ++CurPtr;</span></font></div><div><font face="Menlo"><span style="font-size:11px">} while (is</span></font><span style="font-family:Menlo;font-size:11px">NonUTF</span><span style="font-family:Menlo;font-size:11px">IdentifierBody(C))</span></div>


<div><span style="font-family:Menlo;font-size:11px"><br></span></div><div><font face="Menlo"><span style="font-size:11px">...</span></font></div><div><font face="Menlo"><span style="font-size:11px"><br></span></font></div>


<div><font face="Menlo"><span style="font-size:11px">UTFIdentifier:</span></font></div><div><font face="Menlo"><span style="font-size:11px">  ... handle the general case here.</span></font></div><span class="HOEnZb"><font color="#888888"><div>


<font face="Menlo"><span style="font-size:11px"><br></span></font></div><div><font face="Menlo"><span style="font-size:11px">-Chris</span></font></div></font></span></div></blockquote><div><br>If speed is /really/ important, I'd just go for an 8KB lookup table, since it's a one-time memory cost and it's not a lot compared to the usual total memory cost, and then it has an advantage that UTF-8 heavy code, if it ever happens, will run quickly if the table stays in cache. But I can do as suggested as well.<br>


<br>Sean<br></div></div><br>