[cfe-dev] Source code, character sets and encodings

Sun Jun 7 02:57:08 PDT 2009

[Note: subject line changed]

> -----Original Message-----
> From: cfe-dev-bounces at cs.uiuc.edu [mailto:cfe-dev-bounces at cs.uiuc.edu]
> On Behalf Of Sebastian Redl
> Sent: 07 June 2009 09:40
> To: Neil Booth
> Cc: cfe-dev at cs.uiuc.edu
> Subject: Re: [cfe-dev] Almost there...

> The source character set is irrelevant, at least going by the C++
> standard.
> The very first phase of translation (C++ 2.2p1, bullet point 1)
> specifies
> an implementation-defined mapping of physical source file characters to
> the
> basic source character set. Making that mapping a UTF-32 to UTF-8
> coding is
> perfectly valid.
> Interestingly enough, the standard says that any character not in the
> basic
> set must be encoded as a ucn. That sounds impractical, so I guess since
> we
> want to use UTF-8 internally anyway, we should make use of the as-if
> rule
> and instead represent everything, including ucns in the original
> source, in
> its real UTF-8 encoding.

I'm putting together a HTML document that will hopefully describe current
Clang assumptions and handling of source code and encodings, together with a
set of proposals to go forward with UCNs, Unicode string literals, raw
string literals, and source files in encodings other than UTF-8.  This will
be very biased towards the C++ standard requirements, although if you can
point me to specification for ObjectiveC I will take that on board.  I
believe the C rules are very similar to C++ as the standards try to stay in
synch for these low level details, although I will double-check for corner
cases.

For reference, in a former life I was PM for a C++ IDE and was very
surprised at the demand from Japan for UTF-16 support in source files.  The
assumption that UTF-8 would be adequate did not hold.  There seemed little
demand for support for UTF-32 encoding though.

This is clearly more work than I thought I was getting into for a first
project, but if it's worth doing then it is worth doing right - and there
are a number of features bound together here that I really want a plan to
deliver as a set, even if the implementation is incremental.

I'm also trying to pull together a few more papers for the next C++
committee mailing, due in two weeks, and I guess this will be ready shortly
after that.

Issues I need to investigate right now are how/if we handle UCNs.  The
impact is that a UCN will most probably take fewer characters in its string
literal representation than in the source itself, and we certainly can't
assume a 1-1 mapping of source locations to string literal representations.
Diagnostics probably will want both representations, so users get a chance
to see if their UCN character matches the glyphs they expected, while still
getting an accurate representation of the source.  Likewise, we must handle
UCNs in identifiers with similar issues of reporting diagnostics.  My
initial inclination for identifiers is that displaying the UCN as the
specified glyph is a job for IDEs and similar tools, and from the command
line with simply return the UCN as written in source.

AlisdairM