<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Jan 14, 2013, at 13:19 , Richard Smith <<a href="mailto:richard@metafoo.co.uk">richard@metafoo.co.uk</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div class="gmail_quote">As a general point, please keep in mind how we might support UTF-8 in source code when working on this. The C++ standard requires that our observable behavior is that we treat extended characters in the source code and UCNs identically (modulo raw string literals), so the more code we can share between the two, the better.</div></blockquote><blockquote type="cite"><div class="gmail_quote">

<div><br></div><div>Please see the attached patch for a start on implementing UTF-8 support. One notable difference between this and the UCN patch is that the character validation happens in the lexer, not when we come to look up an IdentifierInfo; this is necessary in order to support error recovery for UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids for UCNs which we never convert to identifiers.</div>

</div></blockquote><br></div><div>I was trying to avoid using a sentinel char value; one reason is my three-quarters-finished implementation of fixits for smart quotes. If we just assume that UTF-8 characters are rare, we can handle them in LexTokenInternal's 'default' case, and use a 'classifyUTF8()' helper rather than smashing the character input stream with placeholders.</div><div><br></div><div>The main difference between UCNs and literal UTF-8 is that (valid) literal UTF-8 will always appear literally in the source. But I guess it doesn't matter so much since the only place Unicode is valid is in identifiers and as whitespace, and neither of those will use the output of getCharAndSize. I do, however, want to delay the check for if a backslash starts a UCN to avoid Eli's evil recursion problem:</div><div><br></div><div>char *\\</div><div>\\\</div><div>\\</div><div>\\\</div><div>\\</div><div>\\\</div><div>\u00FC;</div><div><br></div><div>If UCNs are processed in getCharAndSize, you end up with several recursive calls asking if the first backslash starts a UCN. It doesn't, of course, but if getCharAndSize calls isUCNAfterSlash you need to getCharAndSize all the way to the character after the final backslash to prove it. After all, this <i>is</i> a UCN, in C at least:</div><div><br></div><div>char *\</div><div>\</div><div>u00FC;</div><div><br></div><div>And once we're delaying the backslash, I'm not sure it makes sense to classify the Unicode until we hit LexTokenInternal. Once we get there, though, I can see it making sense to do it there rather than in identifier creation, and have a (mostly) unified Unicode path after that.</div><div><br></div><div>Jordan</div></body></html>