<div class="gmail_quote">On Mon, Jan 14, 2013 at 11:53 AM, Jordan Rose <span dir="ltr"><<a href="mailto:jordan_rose@apple.com" target="_blank">jordan_rose@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

This got load-balanced to me, so I've been reworking Eli's patch to handle the recursive-getCharAndSize problem:<br>

<br>

> // Parsing this UCN requires line-splicing. This is valid C99.<br>

<br>

> #define newline_1_\u00F\<br>

> C 1<br>

<br>

The basic idea is the same ("the spelling of the token contains the UCN, but the IdentifierInfo contains pure UTF-8"), but rather than handle UCNs in getCharAndSize, this version accepts them only in LexIdentifier (and LexTokenInternal, to start an identifier). This doesn't strictly follow the abstract model of translation described in the standard(s), but it seems to work well in practice.<br>

</blockquote><div><br></div><div>Sure, we don't need to worry too much about the abstract models of C and C++ here, so long as we get the actual behavior correct.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Eli originally wrote this:<br>

<div class="im">> I'm intentionally leaving out most of the support for universal<br>

> character names in user-defined literals, to try and reduce the size<br>

> of the patch.<br>

</div>and I haven't put that support back yet.<br>

<br>

I also took out the test for token-pasting to form a UCN. As I read the standards, this is undefined in C99 and C11 (5.1.1.2p1.4), and C++03 and C++11 [lex.phases]p1.4, so I don't think we should worry about it.<br></blockquote>

<div><br></div><div>We should have a test just to ensure that we don't crash or otherwise misbehave, at least.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


There are a couple FIXMEs for the differences between the allowed characters of C99, C++03, and C11/C++11. Also the diagnostics are pretty minimal:<br>

- no specific warning/fixit for an incorrect number of hex digits in a UCN<br>

- no specific warning for using a character in the basic character set (it just says "this is not a valid UCN")<br>

<br>

I only added a few test cases on top of Eli's, because I want to get feedback, but here's what I'm planning to test:<br>

- redeclaration using different escapes (\u and \U)<br>

- proper emission of LLVM bitcode<br>

<br>

And even though these can't be added to the test suite:<br>

- source -> LLVM bitcode -> executable<br>

- source -> object files -> executable<br>

- source -> LLVM bitcode -> object files -> executable<br>

<br>

(I'm not testing LLVM IR because I don't think we care right now if our IR parser can handle UTF-8, but our bitcode reader definitely should be able to.)<br>

<br>

Comments?<br></blockquote><div><br></div><div>As a general point, please keep in mind how we might support UTF-8 in source code when working on this. The C++ standard requires that our observable behavior is that we treat extended characters in the source code and UCNs identically (modulo raw string literals), so the more code we can share between the two, the better.</div>

<div><br></div><div>Please see the attached patch for a start on implementing UTF-8 support. One notable difference between this and the UCN patch is that the character validation happens in the lexer, not when we come to look up an IdentifierInfo; this is necessary in order to support error recovery for UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids for UCNs which we never convert to identifiers.</div>

<div><br></div><div>-- Richard</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class="HOEnZb"><font color="#888888">Jordan<br>

<br>

</font></span><br><br>

<br>

<br>

On Nov 15, 2012, at 19:17 , Eli Friedman <<a href="mailto:eli.friedman@gmail.com">eli.friedman@gmail.com</a>> wrote:<br>

<br>

> Patch attached.  Adds support universal character names in identifiers, e.g.:<br>

><br>

> char * \u00FC = "u-umlaut";<br>

><br>

> Not that it's particularly useful, but it's a longstanding hole in our<br>

> C99 support.<br>

><br>

> The general outline of the approach is that the spelling of the<br>

> identifier token contains the UCN, but the IdentifierInfo for the<br>

> identifier token contains pure UTF-8.  I think this is reasonable<br>

> given the C phases of translation, and consistent with the way we<br>

> handle UCNs in other contexts.<br>

><br>

> I'm intentionally leaving out most of the support for universal<br>

> character names in user-defined literals, to try and reduce the size<br>

> of the patch.<br>

><br>

> I know this patch is a little lacking in terms of tests, but I'm not<br>

> really sure what tests we need; suggestions welcome.<br>

><br>

> -Eli<br>

> <ucn-id.txt>_______________________________________________<br>

> cfe-commits mailing list<br>

> <a href="mailto:cfe-commits@cs.uiuc.edu">cfe-commits@cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>

<br>

<br></blockquote></div><br>