[LLVMdev] [cfe-dev] Unicode path handling on Windows

Ruben Van Boxem vanboxem.ruben at gmail.com
Thu Sep 1 08:44:38 PDT 2011


Op 1 sep. 2011 14:12 schreef "NAKAMURA Takumi" <geek4civic at gmail.com> het
volgende:
>
> Guys, welcome to the too weird i18n world!
> We, Japanese, has got suffered for multibyte charset for 20 years.
>
> I have added a comment in http://llvm.org/bugs/show_bug.cgi?id=10348 .
> Of course I know, I don't think it would be a practical resolution.
>
> FYI, it seems clang can retrieve mbcs path with s/CP_UTF8/CP_ACP/g.
>
> E>bin\clang.exe -S なかむら\たくみ.c
> なかむら\たくみ.c:4:2: error: #error
> #error
>  ^
> 1 error generated.
>
> Though, you should know, MBCS still has an issue;
>
> E>bin\clang.exe -S 表はダメ文字\表はダメ文字.c
> clang: error: no such file or directory: '表はダメ文字\表はダメ文字.c'
> clang: error: no input files
>
> Note, "表" is represented as "0x95 0x5C" in CP932.
>
> In principle, IMHHHO;
>
>  - argv should be treated as "blackbox" byte stream.
>  - Don't assume "wmain(argc, wchar_t **argv)". mingw does not have
> one. Then, argv must be presented as the default codepage.

Correction: I believe MinGW-w64 has a Unicode startup and thus support for
wmain (but of course it would be better to shift this to strict API
functions)

>  - A few codepage (eg. cp932 Japanese shift jis) might contain
> 0x5C(\) in 2nd (leading) octet.
>
> Win32 ANSI (****A) APIs assume local codepage.
>
> We should do in llvm;
>
>  - Treat pathstring in argv as blackbox. Never parse
> (char*)pathstring without any knowledge.
>  - UTF8 would be useless on win32. Win32 does not manipulate utf8
> implicitly in anywhere.
>  - Path API should hold pathstring as API-native form (bytestream on
> unix, UCS2 wchar_t on win32).
>  - Path should be manipulated as API-native form as possible.

Isn't it more straightforward to use utf-8 internally and use the conversion
functions provided by the win32 API when calling other win32 API functions,
and always call the wide versions of the win32 functions. Full compatibility
guaranteed, and one encoding internally.

Ruben
>
> In future, we might consider "-finput-charset" and "-fexec-charset" on
clang.
> Please consider an source file;
>
> ////////
> #include "むすめは/まおちゃん.h"
> char const literal[] = "俺です、俺俺";
> ////////
>
> The include path (#include) should be handled as host-dependent. The
> literal should be interperted with input-charset and be emitted with
> exec-charset.
>
> Too hard the life is. Would you like to live in Japan? :p
>
>
> ...Takumi
>
>
> 2011/9/1 Nikola Smiljanic <popizdeh at gmail.com>:
> > The function available in clang/lib/Basic/ConvertUTF.c deals with
unsigned
> > shorts, and I need wchar_t?
> >
> > On Thu, Sep 1, 2011 at 9:36 AM, Jean-Daniel Dupas <
devlists at shadowlab.org>
> > wrote:
> >>
> >> Le 31 août 2011 à 21:02, Aaron Ballman a écrit :
> >>
> >> > On Wed, Aug 31, 2011 at 1:17 PM, Eli Friedman <eli.friedman at gmail.com
>
> >> > wrote:
> >> >> On Wed, Aug 31, 2011 at 10:58 AM, Nikola Smiljanic <
popizdeh at gmail.com>
> >> >> wrote:
> >> >>> _wopen expects wchar_t* and the only visible function for
conversion
> >> >>> to
> >> >>> utf16 is ConvertUTF8toUTF32 which converts to unsigned shorts.
> >> >>
> >> >> If you're in #ifdef WIN32 code, just use ConvertUTF8toUTF16 and
> >> >> reinterpret_cast from unsigned short* to wchar_t*.
> >> >
> >> > I think the problem is that PathV2.inc is part of LLVM, and the
> >> > ConvertUTF8ToUTF16 function is in an anonymous namespace.  So the
> >> > question becomes: raise the function into an accessible namespace,
> >> > duplicate code, or find some other mechanism?
> >>
> >> This function is also available in clang/lib/Basic/ConvertUTF.c
> >>
> >> >
> >> > I don't think it makes sense to raise the function out of the
> >> > anonymous namespace unless it's also moved (it has nothing to do with
> >> > paths per se).  Perhaps it's worth it to move it to StringRef?
> >> >
> >> > ~Aaron
> >> >
> >> > _______________________________________________
> >> > cfe-dev mailing list
> >> > cfe-dev at cs.uiuc.edu
> >> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >>
> >> -- Jean-Daniel
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> cfe-dev mailing list
> >> cfe-dev at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110901/f421b61c/attachment.html>


More information about the llvm-dev mailing list