[LLVMdev] [cfe-dev] Unicode path handling on Windows

NAKAMURA Takumi geek4civic at gmail.com
Thu Sep 1 05:12:10 PDT 2011


Guys, welcome to the too weird i18n world!
We, Japanese, has got suffered for multibyte charset for 20 years.

I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
Of course I know, I don't think it would be a practical resolution.

FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.

E>bin\clang.exe -S なかむら\たくみ.c
なかむら\たくみ.c:4:2: error: #error
#error
 ^
1 error generated.

Though, you should know, MBCS still has an issue;

E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c
clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c'
clang: error: no input files

Note, "表" is represented as "0x95 0x5C" in CP932.

In principle, IMHHHO;

  - argv should be treated as "blackbox" byte stream.
  - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have
one. Then, argv must be presented as the default codepage.
  - A few codepage (eg. cp932 Japanese shift jis) might contain
0x5C(\) in 2nd (leading) octet.

Win32 ANSI (****A) APIs assume local codepage.

We should do in llvm;

  - Treat pathstring in argv as blackbox. Never parse
(char*)pathstring without any knowledge.
  - UTF8 would be useless on win32. Win32 does not manipulate utf8
implicitly in anywhere.
  - Path API should hold pathstring as API-native form (bytestream on
unix, UCS2 wchar_t on win32).
  - Path should be manipulated as API-native form as possible.

In future, we might consider "-finput-charset" and "-fexec-charset" on clang.
Please consider an source file;

////////
#include "むすめは/まおちゃん.h"
char const literal[] = "俺です、俺俺";
////////

The include path (#include) should be handled as host-dependent. The
literal should be interperted with input-charset and be emitted with
exec-charset.

Too hard the life is. Would you like to live in Japan? :p


...Takumi


2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>:
> The function available in clang/lib/Basic/ConvertUTF.c deals with unsigned
> shorts, and I need wchar_t?
>
> On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <devlists at shadowlab.org>
> wrote:
>>
>> Le 31 août 2011 à 21:02, Aaron Ballman a écrit :
>>
>> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman <eli.friedman at gmail.com>
>> > wrote:
>> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic <popizdeh at gmail.com>
>> >> wrote:
>> >>> _wopen expects wchar_t* and the only visible function for conversion
>> >>> to
>> >>> utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts.
>> >>
>> >> If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and
>> >> reinterpret_cast from unsigned short* to wchar_t*.
>> >
>> > I think the problem is that PathV2.inc is part of LLVM, and the
>> > ConvertUTF8ToUTF16 function is in an anonymous namespace.  So the
>> > question becomes: raise the function into an accessible namespace,
>> > duplicate code, or find some other mechanism?
>>
>> This function is also available in clang/lib/Basic/ConvertUTF.c
>>
>> >
>> > I don't think it makes sense to raise the function out of the
>> > anonymous namespace unless it's also moved (it has nothing to do with
>> > paths per se).  Perhaps it's worth it to move it to StringRef?
>> >
>> > ~Aaron
>> >
>> > _______________________________________________
>> > cfe-dev mailing list
>> > cfe-dev at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>> -- Jean-Daniel
>>
>>
>>
>>
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>




More information about the llvm-dev mailing list