[cfe-commits] [PATCH] Support for universal character names in identifiers

Tue Nov 27 14:25:43 PST 2012

I had a look at supporting UTF-8 in source files, and came up with the
attached approach. getCharAndSize maps UTF-8 characters down to a char with
the high bit set, representing the class of the character rather than the
character itself. (I've not done any performance measurements yet, and the
patch is generally far from being ready for review).

Have you considered using a similar approach for lexing UCNs? We already
land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
them there. Also, validating the codepoints early would allow us to recover
better (for instance, from UCNs encoding whitespace or elements of the
basic source character set).

On Fri, Nov 16, 2012 at 9:33 PM, Richard Smith <richard at metafoo.co.uk>wrote:

> On Fri, Nov 16, 2012 at 6:53 PM, Eli Friedman <eli.friedman at gmail.com>
> wrote:
> > On Thu, Nov 15, 2012 at 8:30 PM, Richard Smith <richard at metafoo.co.uk>
> wrote:
> >> On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <eli.friedman at gmail.com>
> wrote:
> >>> Patch attached.  Adds support universal character names in
> identifiers, e.g.:
> >>>
> >>> char * \u00FC = "u-umlaut";
> >>>
> >>> Not that it's particularly useful, but it's a longstanding hole in our
> >>> C99 support.
> >>>
> >>> The general outline of the approach is that the spelling of the
> >>> identifier token contains the UCN, but the IdentifierInfo for the
> >>> identifier token contains pure UTF-8.  I think this is reasonable
> >>> given the C phases of translation, and consistent with the way we
> >>> handle UCNs in other contexts.
> >>
> >> This seems like a good approach to me.
> >>
> >>> I'm intentionally leaving out most of the support for universal
> >>> character names in user-defined literals, to try and reduce the size
> >>> of the patch.
> >>
> >> Index: include/clang/Lex/Lexer.h
> >> ===================================================================
> >> --- include/clang/Lex/Lexer.h   (revision 168014)
> >> +++ include/clang/Lex/Lexer.h   (working copy)
> >> @@ -573,6 +573,10 @@
> >>    void cutOffLexing() { BufferPtr = BufferEnd; }
> >>
> >>    bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);
> >> +
> >> +  bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned
> SizeTmp[5]);
> >> +  void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5],
> >> +                            Token &Result);
> >>
> >> These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in
> >> a struct so it doesn't have to be repeated in so many places, or at
> >> least passing it by reference so we'll get a compile error if the
> >> caller's array is the wrong size?
> >>
> >> Index: include/clang/Lex/Token.h
> >> ===================================================================
> >> --- include/clang/Lex/Token.h   (revision 168014)
> >> +++ include/clang/Lex/Token.h   (working copy)
> >> @@ -74,9 +74,10 @@
> >>      StartOfLine   = 0x01,  // At start of line or only after
> whitespace.
> >>      LeadingSpace  = 0x02,  // Whitespace exists before this token.
> >>      DisableExpand = 0x04,  // This identifier may never be macro
> expanded.
> >> -    NeedsCleaning = 0x08,   // Contained an escaped newline or
> trigraph.
> >> +    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.
> >>      LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
> >> -    HasUDSuffix = 0x20     // This string or character literal has a
> ud-suffix.
> >> +    HasUDSuffix = 0x20,    // This string or character literal has a
> ud-suffix.
> >> +    HasUCN = 0x40          // This identifier contains a UCN
> >>
> >> Missing full stop. ;-)
> >>
> >> The set of permitted characters appears to be correct only for C11 and
> >> C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller
> >> sets (and not even the same smaller set!). C++98 used the list from
> >> ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly,
> >> C++03 didn't move from the PDTR to the 1998 TR). If you're doing this
> >> to have a complete C99 (and C++98, modulo 'export') implementation,
> >> then maybe you care about this... :)
> >
> > I'll have to check whether I care about this.
> >
> >> +          if (UCNIdentifierBuffer.empty() ?
> !isAllowedInitiallyIDChar(UcnVal) :
> >> +                                            !isAllowedIDChar(UcnVal)) {
> >> +            StringRef CurCharacter = CleanedStr.substr(i, NumChars);
> >> +            Diag(Identifier, diag::err_ucn_invalid_in_id) <<
> CurCharacter;
> >>
> >> It'd be nice for the diagnostic to be different for UCNs which can't
> >> appear at all versus UCNs which can't appear at the start of an
> >> identifier.
> >>
> >>> I know this patch is a little lacking in terms of tests, but I'm not
> >>> really sure what tests we need; suggestions welcome.
> >>
> >> UCNs which resolve to characters in the basic source character set.
> >> Identifier emission in diagnostics.
> >> Stringization of tokens containing UCNs. (If I'm reading this right,
> >> we have a pre-existing bug here, in that characters outside the basic
> >> source character set must be converted into UCNs in the resulting
> >> string literal.)
> >
> > You mean like the following?
> >
> > #define X "\u00FC"
> > #define X "ü"
> >
> > This is valid in C++, but not C. :(
>
> That, and also this:
>
> #define STR(x) #x
> const char *p = STR("ü");
>
> ... which appears to be required to produce something like
> "\"\\u00FC\"" in C++, but to produce "\"ü\"" in C. This:
>
> const char *q = STR("\u00FC");
>
> ... produces "\"\\u00FC\"" in C++, and produces an
> implementation-defined choice of "\"\\u00FC\"" and "\"\u00FC\"" in C.
>
> >> ud-suffixes for integer and floating-point.
> >
> > Not working, but I'll add tests anyway.
> >
> >> Do you want to ExtWarn on this in C89?
> >
> > Err, actually, I think we need to disable this completely for C89;
> > IIRC, it's possible to write a valid C89 program which contains
> > something which looks like a UCN.
>
> Hmm, OK, although the \ couldn't be converted to a token, so it'd need
> to be removed during preprocessing or used as an operand to # or
> similar.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20121127/035ca40e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf-8-in-source.diff
Type: application/octet-stream
Size: 10501 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20121127/035ca40e/attachment.obj>