[cfe-commits] [PATCH] Support for universal character names in identifiers

Mon Jan 14 11:53:43 PST 2013

This got load-balanced to me, so I've been reworking Eli's patch to handle the recursive-getCharAndSize problem:

> // Parsing this UCN requires line-splicing. This is valid C99.

> #define newline_1_\u00F\
> C 1

The basic idea is the same ("the spelling of the token contains the UCN, but the IdentifierInfo contains pure UTF-8"), but rather than handle UCNs in getCharAndSize, this version accepts them only in LexIdentifier (and LexTokenInternal, to start an identifier). This doesn't strictly follow the abstract model of translation described in the standard(s), but it seems to work well in practice.

Eli originally wrote this:
> I'm intentionally leaving out most of the support for universal
> character names in user-defined literals, to try and reduce the size
> of the patch.
and I haven't put that support back yet.

I also took out the test for token-pasting to form a UCN. As I read the standards, this is undefined in C99 and C11 (5.1.1.2p1.4), and C++03 and C++11 [lex.phases]p1.4, so I don't think we should worry about it.

There are a couple FIXMEs for the differences between the allowed characters of C99, C++03, and C11/C++11. Also the diagnostics are pretty minimal:
- no specific warning/fixit for an incorrect number of hex digits in a UCN
- no specific warning for using a character in the basic character set (it just says "this is not a valid UCN")

I only added a few test cases on top of Eli's, because I want to get feedback, but here's what I'm planning to test:
- redeclaration using different escapes (\u and \U)
- proper emission of LLVM bitcode

And even though these can't be added to the test suite:
- source -> LLVM bitcode -> executable
- source -> object files -> executable
- source -> LLVM bitcode -> object files -> executable

(I'm not testing LLVM IR because I don't think we care right now if our IR parser can handle UTF-8, but our bitcode reader definitely should be able to.)

Comments?
Jordan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: UCNs.patch
Type: application/octet-stream
Size: 16760 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130114/10388799/attachment.obj>
-------------- next part --------------

On Nov 15, 2012, at 19:17 , Eli Friedman <eli.friedman at gmail.com> wrote:

> Patch attached.  Adds support universal character names in identifiers, e.g.:
> 
> char * \u00FC = "u-umlaut";
> 
> Not that it's particularly useful, but it's a longstanding hole in our
> C99 support.
> 
> The general outline of the approach is that the spelling of the
> identifier token contains the UCN, but the IdentifierInfo for the
> identifier token contains pure UTF-8.  I think this is reasonable
> given the C phases of translation, and consistent with the way we
> handle UCNs in other contexts.
> 
> I'm intentionally leaving out most of the support for universal
> character names in user-defined literals, to try and reduce the size
> of the patch.
> 
> I know this patch is a little lacking in terms of tests, but I'm not
> really sure what tests we need; suggestions welcome.
> 
> -Eli
> <ucn-id.txt>_______________________________________________
> cfe-commits mailing list
> cfe-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits