[cfe-dev] Question on character sets and encodings

AlisdairM(public) public at alisdairm.net
Sat Jun 6 02:12:25 PDT 2009


According to C++ phases of translation (I don't know C or ObjectiveC) source
files should be transformed into the 'basic source character set' before
parsing, with any characters outside this set turned into a
universal=character-name representation.

Of course, this is an 'as-if' rule and we are free to implement something
that does such translation on the fly, or be really smart and work with a
different character set/encoding entirely that behaves as a super-set (e.g.
ASCII or UTF8).

So my question is: What does Clang actually do?

Within a parse function, can I assume any character I meet will be
exclusively from the basic character set? Can I assume ASCII encoding (e.g.
all control characters have value < 32)?

Conversely, what source encodings does Clang accept?
Can I feed it a file with UTF-8/UTF-16/UTF32 encodings?

I'm going to need to understand this if I'm going to do implement
char16_t/char32_t 'right', rather than quickly, and I'm beginning to think I
could have picked an easier starter project after all!

Finally, are there any existing Unicode facilities in the code base I can
call on when trying to transcode into/out-of Unicode?

Thanks
AlisdairM






More information about the cfe-dev mailing list