[cfe-dev] [REVIEW] UTF-8 in identifiers proof of concept

Sean Hunt scshunt at csclub.uwaterloo.ca
Fri Mar 30 18:12:33 PDT 2012


On Fri, Mar 30, 2012 at 14:55, Chris Lattner <clattner at apple.com> wrote:
>
> +/// identifier, which is [a-zA-Z0-9_], or a Unicode character defined by
> +/// Annex E of the C++ standard (note: pretty sure this is different in
> C).
> +static inline bool isIdentifierBody(UTF32 c) {
> +  if (c <= 127)
> +    return (CharInfo[c] & (CHAR_LETTER|CHAR_NUMBER|CHAR_UNDER)) ? true :
> false;
> +  else if (c & 0xffff0000)
> +    return ~c & 0xfffd;
>
> This should be split into two functions: isIdentifierBody() which just
> handles the <= 127 case, and the rest in a attribute(no_inline) function to
> handle the slowpath.
>
> We don't want the slow path to prevent inlining of the fast path.  If we
> have a macro for builtin_expect, it would be worth using it on the "c <=
> 127" branch.
>
> I would also prefer the loop in Lexer::LexIdentifier to be written
> something like this, which might make the idea above irrelevant:
>
>
> do {
>   C = *CurPtr;
>   if (C > 127) goto UTFIdentifier;
>   ++CurPtr;
> } while (isNonUTFIdentifierBody(C))
>
> ...
>
> UTFIdentifier:
>   ... handle the general case here.
>
> -Chris
>

If speed is /really/ important, I'd just go for an 8KB lookup table, since
it's a one-time memory cost and it's not a lot compared to the usual total
memory cost, and then it has an advantage that UTF-8 heavy code, if it ever
happens, will run quickly if the table stays in cache. But I can do as
suggested as well.

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120330/f821e977/attachment.html>


More information about the cfe-dev mailing list