[cfe-dev] Wide strings and clang::StringLiteral.

Eli Friedman eli.friedman at gmail.com
Thu Dec 4 20:48:21 PST 2008


On Thu, Dec 4, 2008 at 8:02 PM, Chris Lattner <clattner at apple.com> wrote:
>>> 2) Add -finput-charset to clang.  Is iconv generally available (e.g.
>>> on windows?) if not, we'll need some configury magic to detect it.
>>
>> It's not particularly difficult to get, but Windows users are unlikely
>> to have it installed.
>
> Ok, it would be nice to not add a dependency so people can get started
> quickly.  Are there any license issues?

libiconv is LGPL; whether this is an "issue" license probably depends
on the company.  There's also apparently a lightweight alternative
called win_iconv which I stumbled upon while I was looking around, but
I don't know much about it.

>>> 3) Teach sema about UTF8 input and iconv.  Sema should handle the
>>> default cases (e.g. UTF8 and character sets where no "bad" things
>>> occur) as quickly as possible, while falling back to iconv for hard
>>> cases (or emitting an error if iconv isn't available).
>>
>> "Good" charsets, if I'm understanding correctly, are those character
>> sets which are a superset of ASCII and where ASCII bytes are never
>> part of a multi-byte sequence representing something else.  This
>> includes charsets like UTF-8, ISO-8859-*, and EUC-JP.  "Bad" charsets
>> include UTF-16 (not an ASCII superset) and Shift JIS (breaks the
>> multi-byte sequence rule).
>>
>> Assuming Sema never sees a string in a "bad" charset, conversion can
>> be skipped if and only if either the input and execution character
>> sets are the same, or the string contains only ASCII characters.
>
> I'd be fine with specializing on the "ascii subset && string contains only
> characters in the range 0-0x7f" and having a slow path for everything else.

Ah, right, you want to store the strings in UTF-8.  That seems fine; I
expect non-ASCII in strings is very rare.

Actually, something just occurred to me, though: if I recall
correctly, Shift JIS is the default charset on Japanese Windows
systems.  Do we plan to key the default -finput-charset off of the
default system charset?  I'm not sure what, if anything, we can/should
do in that situation.

-Eli



More information about the cfe-dev mailing list