[cfe-dev] Wide strings and clang::StringLiteral.

Thu Dec 4 20:02:47 PST 2008

On Dec 4, 2008, at 6:48 PM, Eli Friedman wrote:

> On Thu, Dec 4, 2008 at 4:21 PM, Chris Lattner <clattner at apple.com>  
> wrote:
>> Here are some starting steps:
>>
>> 1) Document StringLiteral as being canonicalized to UTF8.  We'll
>> require sema to translate the input string to utf8, and codegen and
>> other clients to convert it to the character set they want.
>
> Sema itself needs to do the translation to the execution charset, or
> at least have some knowledge it; sizeof("こんにちは") depends  
> on the
> execution charset.

Right, I think that should happen through the sema of string literal  
(which sets its type, which includes the length).

>> 2) Add -finput-charset to clang.  Is iconv generally available (e.g.
>> on windows?) if not, we'll need some configury magic to detect it.
>
> It's not particularly difficult to get, but Windows users are unlikely
> to have it installed.

Ok, it would be nice to not add a dependency so people can get started  
quickly.  Are there any license issues?

>> 3) Teach sema about UTF8 input and iconv.  Sema should handle the
>> default cases (e.g. UTF8 and character sets where no "bad" things
>> occur) as quickly as possible, while falling back to iconv for hard
>> cases (or emitting an error if iconv isn't available).
>
> "Good" charsets, if I'm understanding correctly, are those character
> sets which are a superset of ASCII and where ASCII bytes are never
> part of a multi-byte sequence representing something else.  This
> includes charsets like UTF-8, ISO-8859-*, and EUC-JP.  "Bad" charsets
> include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
> multi-byte sequence rule).
>
> Assuming Sema never sees a string in a "bad" charset, conversion can
> be skipped if and only if either the input and execution character
> sets are the same, or the string contains only ASCII characters.

I'd be fine with specializing on the "ascii subset && string contains  
only characters in the range 0-0x7f" and having a slow path for  
everything else.

>> 4) Enhance the lexer, if required, to handle lexing strings properly.
>
> The current lexer likely breaks in a lot of other cases for "bad"
> charsets... we probably just want to convert input in any of these
> charsets upfront, before lexing.  The lexer shouldn't need any changes
> for "good" charsets, if my understanding of the standard is correct.

That makes a lot of sense to me!

-Chris