[cfe-dev] Question on character sets and encodings

Sat Jun 6 04:49:06 PDT 2009

> -----Original Message-----
> From: Eli Friedman [mailto:eli.friedman at gmail.com]
> Sent: 06 June 2009 12:08
> To: AlisdairM(public)
> Cc: cfe-dev at cs.uiuc.edu
> Subject: Re: [cfe-dev] Question on character sets and encodings
> 
> On Sat, Jun 6, 2009 at 2:12 AM, AlisdairM(public)<public at alisdairm.net>
> wrote:
>>> Of course, this is an 'as-if' rule and we are free to implement
>>> something that does such translation on the fly, or be really
>>> smart and work with a different character set/encoding entirely
>>> that behaves as a super-set (e.g. ASCII or UTF8).
>>>
>>> So my question is: What does Clang actually do?

> clang currently does nothing in this regard; in practice, this ends up
> being roughly equivalent to assuming both the source and execution
> charset are UTF-8.  If you want more discussion, try looking through
> the cfe-dev archives.

Thanks.

>>> Within a parse function, can I assume any character I meet will be
>>> exclusively from the basic character set? Can I assume ASCII encoding
>>> (e.g. all control characters have value < 32)?
> 
> Yes, feel free to assume an ASCII superset; the current plan (once
> someone gets around to tackling it) is to translate to UTF-8  any
> charset where that doesn't work.

Ok, so I should assume character like '@' will appear as a single character,
and not translated to the appropriate universal-character-name?

>>> Conversely, what source encodings does Clang accept?
>>> Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

> Currently just UTF-8.  Actually, it might be a decent first project to
> add finput-charset support: it should just be a matter of making the
> source manager do charset translation on the file before starting
> lexing.

Makes sense, although it is a part of the compiler I had been hoping to
leave for others  <g>

I'm not sure about using a compiler switch though, as surely we must cope
with #including a file with a different encoding than the rest of the
project?  For example, if we are encoding in UTF16, it is unlikely that the
standard library was supplied with same encoding, or Boost, or other popular
libraries.

I suggest it might be better to detect a Unicode BOM and transcode
accordingly.  In the absence of a BOM, assume UTF8.

>>> Finally, are there any existing Unicode facilities in the code base I
>>> can call on when trying to transcode into/out-of Unicode?
> 
> See include/Basic/ConvertUTF.h.

Excellent!  I see all the facilities I will want to use are currently
commented out though!

At least there is an implementation of sorts.  I assume they are disabled
for lack of tests/validation?

So it looks the pre-requisite to the pre-requisite of my self-selected easy
first project should be to implement the UTF16/32-to-UTF8 transcoders.  Does
anyone on this list have a good set of reference tests I can use?

AlisdairM