[cfe-dev] [REVIEW] UTF-8 in identifiers proof of concept

Fri Mar 30 11:55:53 PDT 2012

On Mar 30, 2012, at 8:06 AM, Sean Hunt wrote:

> On Sat, Dec 31, 2011 at 22:11, Sean Hunt <scshunt at csclub.uwaterloo.ca> wrote:
> Hey folks,
> 
> Attached is a proof of concept for the handling of UTF-8 in
> identifiers. Aside from the terrible isIdentifierBody function, which
> should be optimized where possible (possibly into a lookup table for
> the BMP, since that would be 8kb, and using the simple bitwise
> operation in there for other planes), I think the approach is the
> correct one. Given that this is sensitive code, however, I would like
> to ensure no one has any issues with this approach before I convert
> more of the lexer code over.
> 
> Sean
> 
> This patch still applies reasonably cleanly; any feedback?

+/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by
+/// Annex E of the C++ standard (note: pretty sure this is different in C).
+static inline bool isIdentifierBody(UTF32 c) {
+  if (c <= 127)
+    return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true : false;
+  else if (c & 0xffff0000)
+    return ~c & 0xfffd;

This should be split into two functions: isIdentifierBody() which just handles the <= 127 case, and the rest in a attribute(no_inline) function to handle the slowpath.

We don't want the slow path to prevent inlining of the fast path.  If we have a macro for builtin_expect, it would be worth using it on the "c <= 127" branch.

I would also prefer the loop in Lexer::LexIdentifier to be written something like this, which might make the idea above irrelevant:

do {
  C = *CurPtr;
  if (C > 127) goto UTFIdentifier;
  ++CurPtr;
} while (isNonUTFIdentifierBody(C))

...

UTFIdentifier:
  ... handle the general case here.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120330/0f52415f/attachment.html>