[cfe-commits] [PATCH] Support for universal character names in identifiers

Mon Jan 14 13:19:41 PST 2013

On Mon, Jan 14, 2013 at 11:53 AM, Jordan Rose <jordan_rose at apple.com> wrote:

> This got load-balanced to me, so I've been reworking Eli's patch to handle
> the recursive-getCharAndSize problem:
>
> > // Parsing this UCN requires line-splicing. This is valid C99.
>
> > #define newline_1_\u00F\
> > C 1
>
> The basic idea is the same ("the spelling of the token contains the UCN,
> but the IdentifierInfo contains pure UTF-8"), but rather than handle UCNs
> in getCharAndSize, this version accepts them only in LexIdentifier (and
> LexTokenInternal, to start an identifier). This doesn't strictly follow the
> abstract model of translation described in the standard(s), but it seems to
> work well in practice.
>

Sure, we don't need to worry too much about the abstract models of C and
C++ here, so long as we get the actual behavior correct.

> Eli originally wrote this:
> > I'm intentionally leaving out most of the support for universal
> > character names in user-defined literals, to try and reduce the size
> > of the patch.
> and I haven't put that support back yet.
>
> I also took out the test for token-pasting to form a UCN. As I read the
> standards, this is undefined in C99 and C11 (5.1.1.2p1.4), and C++03 and
> C++11 [lex.phases]p1.4, so I don't think we should worry about it.
>

We should have a test just to ensure that we don't crash or otherwise
misbehave, at least.

> There are a couple FIXMEs for the differences between the allowed
> characters of C99, C++03, and C11/C++11. Also the diagnostics are pretty
> minimal:
> - no specific warning/fixit for an incorrect number of hex digits in a UCN
> - no specific warning for using a character in the basic character set (it
> just says "this is not a valid UCN")
>
> I only added a few test cases on top of Eli's, because I want to get
> feedback, but here's what I'm planning to test:
> - redeclaration using different escapes (\u and \U)
> - proper emission of LLVM bitcode
>
> And even though these can't be added to the test suite:
> - source -> LLVM bitcode -> executable
> - source -> object files -> executable
> - source -> LLVM bitcode -> object files -> executable
>
> (I'm not testing LLVM IR because I don't think we care right now if our IR
> parser can handle UTF-8, but our bitcode reader definitely should be able
> to.)
>
> Comments?
>

As a general point, please keep in mind how we might support UTF-8 in
source code when working on this. The C++ standard requires that our
observable behavior is that we treat extended characters in the source code
and UCNs identically (modulo raw string literals), so the more code we can
share between the two, the better.

Please see the attached patch for a start on implementing UTF-8 support.
One notable difference between this and the UCN patch is that the character
validation happens in the lexer, not when we come to look up an
IdentifierInfo; this is necessary in order to support error recovery for
UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids
for UCNs which we never convert to identifiers.

-- Richard

Jordan
>
>
>
>
>
> On Nov 15, 2012, at 19:17 , Eli Friedman <eli.friedman at gmail.com> wrote:
>
> > Patch attached.  Adds support universal character names in identifiers,
> e.g.:
> >
> > char * \u00FC = "u-umlaut";
> >
> > Not that it's particularly useful, but it's a longstanding hole in our
> > C99 support.
> >
> > The general outline of the approach is that the spelling of the
> > identifier token contains the UCN, but the IdentifierInfo for the
> > identifier token contains pure UTF-8.  I think this is reasonable
> > given the C phases of translation, and consistent with the way we
> > handle UCNs in other contexts.
> >
> > I'm intentionally leaving out most of the support for universal
> > character names in user-defined literals, to try and reduce the size
> > of the patch.
> >
> > I know this patch is a little lacking in terms of tests, but I'm not
> > really sure what tests we need; suggestions welcome.
> >
> > -Eli
> > <ucn-id.txt>_______________________________________________
> > cfe-commits mailing list
> > cfe-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130114/7c1b0315/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf-8-source.diff
Type: application/octet-stream
Size: 10511 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130114/7c1b0315/attachment.obj>