<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Jan 14, 2013, at 17:33 , Richard Smith <<a href="mailto:richard@metafoo.co.uk">richard@metafoo.co.uk</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">On Mon, Jan 14, 2013 at 4:54 PM, Jordan Rose <span dir="ltr"><<a href="mailto:jordan_rose@apple.com" target="_blank">jordan_rose@apple.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word"><div class="im"><br><div><div>On Jan 14, 2013, at 13:19 , Richard Smith <<a href="mailto:richard@metafoo.co.uk" target="_blank">richard@metafoo.co.uk</a>> wrote:</div><br><blockquote type="cite">
<div class="gmail_quote">As a general point, please keep in mind how we might support UTF-8 in source code when working on this. The C++ standard requires that our observable behavior is that we treat extended characters in the source code and UCNs identically (modulo raw string literals), so the more code we can share between the two, the better.</div>
</blockquote><blockquote type="cite"><div class="gmail_quote">
<div><br></div><div>Please see the attached patch for a start on implementing UTF-8 support. One notable difference between this and the UCN patch is that the character validation happens in the lexer, not when we come to look up an IdentifierInfo; this is necessary in order to support error recovery for UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids for UCNs which we never convert to identifiers.</div>
</div></blockquote><br></div></div><div>I was trying to avoid using a sentinel char value; one reason is my three-quarters-finished implementation of fixits for smart quotes. If we just assume that UTF-8 characters are rare, we can handle them in LexTokenInternal's 'default' case, and use a 'classifyUTF8()' helper rather than smashing the character input stream with placeholders.</div>
</div></blockquote><div><br></div><div>I started off with an approach which handled them in the 'default' case, but the current approach ended up much cleaner (in particular, it doesn't require anything special in any identifier lexing code to cope with UTF-8 characters which aren't identifier characters). I don't follow the significance of the smart quotes fixit, I'm afraid; my patch performs a similar fix-up for Unicode whitespace characters, so I would expect that fixit to be implementable either way.</div>
<div></div></div></blockquote><div><br></div><div>I guess you'd just move the smart quotes identification to the "classify this Unicode" function, so okay. (OTOH, a UCN for smart quotes shouldn't really get a fixit for straight quotes, ever.)</div><div><br></div><br><blockquote type="cite"><div class="gmail_quote"><div>My biggest concern with the getCharAndSize approach for UTF-8 is that isObviouslySimpleCharacter is an extraordinarily hot function, and the patch adds a branch to it. I've not measured the impact of that, but it could kill that approach if the cost is too high. Maybe there's some clever (branch-free?) way of detecting '?', '\', and values > 127 which doesn't trigger for any common ASCII characters, though.</div>
<div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"></div></blockquote></div></blockquote><div><br></div><div>A larger jump table would do this, of course, but a memory lookup is probably worse than a branch. Still, that is an advantage of pushing UTF-8 handling back to LexTokenInternal: the hot path doesn't change, and a correct C/C++ program will<i> never</i> hit the new code.</div><div><br></div><br><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>The main difference between UCNs and literal UTF-8 is that (valid) literal UTF-8 will always appear literally in the source. But I guess it doesn't matter so much since the only place Unicode is valid is in identifiers and as whitespace, and neither of those will use the output of getCharAndSize. I do, however, want to delay the check for if a backslash starts a UCN to avoid Eli's evil recursion problem:</div>
<div><br></div><div>char *\\</div><div>\\\</div><div>\\</div><div>\\\</div><div>\\</div><div>\\\</div><div>\u00FC;</div><div><br></div><div>If UCNs are processed in getCharAndSize, you end up with several recursive calls asking if the first backslash starts a UCN.</div>
</div></blockquote></div><div class="gmail_quote"><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">
<div>It doesn't, of course, but if getCharAndSize calls isUCNAfterSlash you need to getCharAndSize all the way to the character after the final backslash to prove it.</div></div></blockquote><div><br></div><div>Once you get to the third backslash and see it's not followed by whitespace, you can stop, right? Is this just a performance concern for a pathological case, or is there more to it than that?</div></div></blockquote><div><br></div>If getCharAndSize is responsible for identifying UCNs, then you have to actually read the whole UCN to give a correct code for it. On the flip side, to properly read a UCN in C mode, you have to go through getCharAndSize instead of reading directly, because you want newline splicing to happen. So even asking if the next character is a UCN ends up requiring you to read <i>all</i> the backslashes, which is fairly stupid on Clang's part.</div><div><br></div><div>We could copy the newline-splicing logic into getCharAndSize, or we could just ignore this pathological case. Or we could go with my approach (handle UCNs in LexTokenInternal and later) which makes the problem go away because "backslash" doesn't imply "eagerly consume" unless spelling-followed by a newline.<br><div><br></div><br><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>After all, this <i>is</i> a UCN, in C at least:</div><div><br></div>
<div>char *\</div><div>\</div><div>u00FC;</div><div><br></div><div>And once we're delaying the backslash, I'm not sure it makes sense to classify the Unicode until we hit LexTokenInternal. Once we get there, though, I can see it making sense to do it there rather than in identifier creation, and have a (mostly) unified Unicode path after that.</div>
</div></blockquote></div><br><div>Yes, if we need to delay handling the backslash, that makes sense to me, but it's likely to be quite an invasive change.</div>
</blockquote><br></div><div>That's what my patch did. I didn't think it was that invasive.</div><br></body></html>