[cfe-dev] Source code, character sets and encodings

Sun Jun 7 03:51:32 PDT 2009

On Sun, Jun 7, 2009 at 2:57 AM, AlisdairM(public)<public at alisdairm.net> wrote:
> I'm putting together a HTML document that will hopefully describe current
> Clang assumptions and handling of source code and encodings, together with a
> set of proposals to go forward with UCNs, Unicode string literals, raw
> string literals, and source files in encodings other than UTF-8.  This will
> be very biased towards the C++ standard requirements, although if you can
> point me to specification for ObjectiveC I will take that on board.

There is no ObjC specification; for this sort of thing, though, it
doesn't have any special rules.

> Issues I need to investigate right now are how/if we handle UCNs.  The
> impact is that a UCN will most probably take fewer characters in its string
> literal representation than in the source itself, and we certainly can't
> assume a 1-1 mapping of source locations to string literal representations.
> Diagnostics probably will want both representations, so users get a chance
> to see if their UCN character matches the glyphs they expected, while still
> getting an accurate representation of the source.

We support UCNs in string/character literals, but not identifiers.
The AST representation translates the UCN, but we can ask the
preprocessor for locations in the original source; see
LiteralSupport.cpp for how we deal with this sort of thing.

> Likewise, we must handle
> UCNs in identifiers with similar issues of reporting diagnostics.  My
> initial inclination for identifiers is that displaying the UCN as the
> specified glyph is a job for IDEs and similar tools, and from the command
> line with simply return the UCN as written in source.

Hmm... not sure.  If we want to allow extended glyphs in identifiers,
we probably have to canonicalize them in the AST; take the following
example:

int 風; // Directly written extended character
int \u98a8; // Same character written with UCN

If we want to accept this, they should both refer to the same object.
And if we don't accept the directly written form, there isn't much
point to accepting the UCN form except to say that we support the
standard...

-Eli