[cfe-dev] Unicode path handling on Windows

Seth Cantrell seth.cantrell at gmail.com
Thu Sep 1 14:17:11 PDT 2011


One issue is that filenames on Windows can include Unicode characters not supported by the current code page, so the filenames in const char *argv[] aren't necessarily usable. The solution is to avoid argv and instead use the Windows API:

#include <ShellAPI.h> // for CommandLineToArgvW
#include <iostream>
#include <string>
#include <vector>

int main() {
#ifdef WIN32

// get UTF-16 encoded wchar_t arguments

    LPWSTR *szArglist;
    int argc;
    szArglist = CommandLineToArgvW(GetCommandLineW(),&argc);
    if(NULL==szArglist) {
       std::cerr << "CommandLineToArgvW failed\n";
    }

// convert to UTF-8 encoded char arguments (C++11)

    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> convert;
    std::vector<std::string> args;
    for(int i=0;i<argc;++i) {
        args.push_back(convert.to_bytes(szArglist[i]));
    }
#endif //ifdef WIN32

}




On Sep 1, 2011, at 4:17 PM, Nikola Smiljanic wrote:

> AFAIK Clang internals do assume utf8, and llvm::sys::path converts strings to utf16 on windows and calls W API functions.
> 
> If somebody would like to take a look at my changes and comment on them. Here's a brief explanation of what I did:
> 
> - Convert argv to utf8 using current system locale for win32 (this is done as soon as possible inside ExpandArgv). This makes the driver happy since calls to llvm::sys::path::exists succeed.
> - Change calls to ::open (inside FileSystemStatCache and MemoryBuffer) to ::_wopen on win32 by converting the path to utf16.
> - In order to do the conversions I had to expose two functions, one of them was already there but wasn't visible, the other one was added by me
> 
> Known issues:
> 
> - I should probably use LLVM_ON_WIN32 instead of WIN32 but this macro isn't defined inside FileSystemStatCache and MemoryBuffer for some reason. Both of these files have an #ifdef section that deals with O_BINARY so maybe these two sections should be consolidated?
> - Functions convert_multibyte_to_utf8 and convert_utf8_to_utf16 have definitions only on windows so every other platform is currently broken.
> 
> On Thu, Sep 1, 2011 at 5:44 PM, Ruben Van Boxem <vanboxem.ruben at gmail.com> wrote:
> Isn't it more straightforward to use utf-8 internally and use the conversion functions provided by the win32 API when calling other win32 API functions, and always call the wide versions of the win32 functions. Full compatibility guaranteed, and one encoding internally.
> 
> Ruben
> 
> <unicode_path_clang.patch><unicode_path_llvm.patch>_______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110901/eb8271a7/attachment.html>


More information about the cfe-dev mailing list