[cfe-dev] Wide strings and clang::StringLiteral.

Eli Friedman eli.friedman at gmail.com
Thu Dec 4 18:48:36 PST 2008


On Thu, Dec 4, 2008 at 4:21 PM, Chris Lattner <clattner at apple.com> wrote:
> Here are some starting steps:
>
> 1) Document StringLiteral as being canonicalized to UTF8.  We'll
> require sema to translate the input string to utf8, and codegen and
> other clients to convert it to the character set they want.

Sema itself needs to do the translation to the execution charset, or
at least have some knowledge it; sizeof("こんにちは") depends on the
execution charset.

> 2) Add -finput-charset to clang.  Is iconv generally available (e.g.
> on windows?) if not, we'll need some configury magic to detect it.

It's not particularly difficult to get, but Windows users are unlikely
to have it installed.

> 3) Teach sema about UTF8 input and iconv.  Sema should handle the
> default cases (e.g. UTF8 and character sets where no "bad" things
> occur) as quickly as possible, while falling back to iconv for hard
> cases (or emitting an error if iconv isn't available).

"Good" charsets, if I'm understanding correctly, are those character
sets which are a superset of ASCII and where ASCII bytes are never
part of a multi-byte sequence representing something else.  This
includes charsets like UTF-8, ISO-8859-*, and EUC-JP.  "Bad" charsets
include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
multi-byte sequence rule).

Assuming Sema never sees a string in a "bad" charset, conversion can
be skipped if and only if either the input and execution character
sets are the same, or the string contains only ASCII characters.

> 4) Enhance the lexer, if required, to handle lexing strings properly.

The current lexer likely breaks in a lot of other cases for "bad"
charsets... we probably just want to convert input in any of these
charsets upfront, before lexing.  The lexer shouldn't need any changes
for "good" charsets, if my understanding of the standard is correct.

-Eli



More information about the cfe-dev mailing list