[cfe-dev] [REVIEW] UTF-8 in identifiers proof of concept
Chris Lattner
clattner at apple.com
Fri Mar 30 11:55:53 PDT 2012
On Mar 30, 2012, at 8:06 AM, Sean Hunt wrote:
> On Sat, Dec 31, 2011 at 22:11, Sean Hunt <scshunt at csclub.uwaterloo.ca> wrote:
> Hey folks,
>
> Attached is a proof of concept for the handling of UTF-8 in
> identifiers. Aside from the terrible isIdentifierBody function, which
> should be optimized where possible (possibly into a lookup table for
> the BMP, since that would be 8kb, and using the simple bitwise
> operation in there for other planes), I think the approach is the
> correct one. Given that this is sensitive code, however, I would like
> to ensure no one has any issues with this approach before I convert
> more of the lexer code over.
>
> Sean
>
> This patch still applies reasonably cleanly; any feedback?
+/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by
+/// Annex E of the C++ standard (note: pretty sure this is different in C).
+static inline bool isIdentifierBody(UTF32 c) {
+ if (c <= 127)
+ return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true : false;
+ else if (c & 0xffff0000)
+ return ~c & 0xfffd;
This should be split into two functions: isIdentifierBody() which just handles the <= 127 case, and the rest in a attribute(no_inline) function to handle the slowpath.
We don't want the slow path to prevent inlining of the fast path. If we have a macro for builtin_expect, it would be worth using it on the "c <= 127" branch.
I would also prefer the loop in Lexer::LexIdentifier to be written something like this, which might make the idea above irrelevant:
do {
C = *CurPtr;
if (C > 127) goto UTFIdentifier;
++CurPtr;
} while (isNonUTFIdentifierBody(C))
...
UTFIdentifier:
... handle the general case here.
-Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120330/0f52415f/attachment.html>
More information about the cfe-dev
mailing list