[cfe-dev] Question on character sets and encodings

Eli Friedman eli.friedman at gmail.com
Sat Jun 6 04:08:16 PDT 2009


On Sat, Jun 6, 2009 at 2:12 AM, AlisdairM(public)<public at alisdairm.net> wrote:
> Of course, this is an 'as-if' rule and we are free to implement something
> that does such translation on the fly, or be really smart and work with a
> different character set/encoding entirely that behaves as a super-set (e.g.
> ASCII or UTF8).
>
> So my question is: What does Clang actually do?

clang currently does nothing in this regard; in practice, this ends up
being roughly equivalent to assuming both the source and execution
charset are UTF-8.  If you want more discussion, try looking through
the cfe-dev archives.

> Within a parse function, can I assume any character I meet will be
> exclusively from the basic character set? Can I assume ASCII encoding (e.g.
> all control characters have value < 32)?

Yes, feel free to assume an ASCII superset; the current plan (once
someone gets around to tackling it) is to translate to UTF-8  any
charset where that doesn't work.

> Conversely, what source encodings does Clang accept?
> Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

Currently just UTF-8.  Actually, it might be a decent first project to
add finput-charset support: it should just be a matter of making the
source manager do charset translation on the file before starting
lexing.

> Finally, are there any existing Unicode facilities in the code base I can
> call on when trying to transcode into/out-of Unicode?

See include/Basic/ConvertUTF.h.

-Eli



More information about the cfe-dev mailing list