On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <span dir="ltr"><<a href="mailto:eli.friedman@gmail.com" target="_blank">eli.friedman@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <<a href="mailto:richard@metafoo.co.uk">richard@metafoo.co.uk</a>> wrote:<br>

> I had a look at supporting UTF-8 in source files, and came up with the<br>

> attached approach. getCharAndSize maps UTF-8 characters down to a char with<br>

> the high bit set, representing the class of the character rather than the<br>

> character itself. (I've not done any performance measurements yet, and the<br>

> patch is generally far from being ready for review).<br>

><br>

> Have you considered using a similar approach for lexing UCNs? We already<br>

> land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them<br>

> there. Also, validating the codepoints early would allow us to recover<br>

> better (for instance, from UCNs encoding whitespace or elements of the basic<br>

> source character set).<br>

<br>

</div>That would affect the spelling of the tokens, and I don't think the C<br>

or C++ standard actually allows us to do that.</blockquote><div><br></div><div>If I understand you correctly, you're concerned that we would get the wrong string in the token's spelling? When we build a token, we take the characters from the underlying source buffer, not the value returned by getCharAndSize.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Evil testcase:<br>

<br>

#define CONCAT(a,b) a ## b<br>

#define \U000100010\u00FD 1<br>

#if !CONCAT(\, U000100010\u00FD)<br>

#error "This should never happen"<br>

#endif<br>

<span class="HOEnZb"><font color="#888888"><br>

-Eli<br>

</font></span></blockquote></div><br>