[cfe-commits] [PATCH] Support for universal character names in identifiers

Mon Jan 14 17:33:47 PST 2013

On Mon, Jan 14, 2013 at 4:54 PM, Jordan Rose <jordan_rose at apple.com> wrote:

>
> On Jan 14, 2013, at 13:19 , Richard Smith <richard at metafoo.co.uk> wrote:
>
> As a general point, please keep in mind how we might support UTF-8 in
> source code when working on this. The C++ standard requires that our
> observable behavior is that we treat extended characters in the source code
> and UCNs identically (modulo raw string literals), so the more code we can
> share between the two, the better.
>
>
> Please see the attached patch for a start on implementing UTF-8 support.
> One notable difference between this and the UCN patch is that the character
> validation happens in the lexer, not when we come to look up an
> IdentifierInfo; this is necessary in order to support error recovery for
> UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids
> for UCNs which we never convert to identifiers.
>
>
> I was trying to avoid using a sentinel char value; one reason is my
> three-quarters-finished implementation of fixits for smart quotes. If we
> just assume that UTF-8 characters are rare, we can handle them in
> LexTokenInternal's 'default' case, and use a 'classifyUTF8()' helper rather
> than smashing the character input stream with placeholders.
>

I started off with an approach which handled them in the 'default' case,
but the current approach ended up much cleaner (in particular, it doesn't
require anything special in any identifier lexing code to cope with UTF-8
characters which aren't identifier characters). I don't follow the
significance of the smart quotes fixit, I'm afraid; my patch performs a
similar fix-up for Unicode whitespace characters, so I would expect that
fixit to be implementable either way.

My biggest concern with the getCharAndSize approach for UTF-8 is that
isObviouslySimpleCharacter is an extraordinarily hot function, and the
patch adds a branch to it. I've not measured the impact of that, but it
could kill that approach if the cost is too high. Maybe there's some clever
(branch-free?) way of detecting '?', '\', and values > 127 which doesn't
trigger for any common ASCII characters, though.

> The main difference between UCNs and literal UTF-8 is that (valid) literal
> UTF-8 will always appear literally in the source. But I guess it doesn't
> matter so much since the only place Unicode is valid is in identifiers and
> as whitespace, and neither of those will use the output of getCharAndSize.
> I do, however, want to delay the check for if a backslash starts a UCN to
> avoid Eli's evil recursion problem:
>
> char *\\
> \\\
> \\
> \\\
> \\
> \\\
> \u00FC;
>
> If UCNs are processed in getCharAndSize, you end up with several recursive
> calls asking if the first backslash starts a UCN.
>

It doesn't, of course, but if getCharAndSize calls isUCNAfterSlash you need
> to getCharAndSize all the way to the character after the final backslash to
> prove it.
>

Once you get to the third backslash and see it's not followed by
whitespace, you can stop, right? Is this just a performance concern for a
pathological case, or is there more to it than that?

> After all, this *is* a UCN, in C at least:
>
> char *\
> \
> u00FC;
>
> And once we're delaying the backslash, I'm not sure it makes sense to
> classify the Unicode until we hit LexTokenInternal. Once we get there,
> though, I can see it making sense to do it there rather than in identifier
> creation, and have a (mostly) unified Unicode path after that.
>

Yes, if we need to delay handling the backslash, that makes sense to me,
but it's likely to be quite an invasive change.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130114/078aed6f/attachment.html>