[LLVMdev] [cfe-dev] Unicode path handling on Windows

Nikola Smiljanic popizdeh at gmail.com
Mon Oct 3 09:21:32 PDT 2011


towlower doesn't seem to work with my test string in Cyrillic. This function
does locale-specific conversion, is this what we want?

Here's the whole thing, with all the calls to ::stat replaced with
llvm::sys::fs::Stat.

On Fri, Sep 30, 2011 at 8:04 PM, Bryce Cogswell <bryceco at gmail.com> wrote:

> You can use _iswupper and _towlower instead of CharLowerBuffW. They don't
> require windows.h and work with /Za.
>
> On Sep 30, 2011, at 12:24 AM, Nikola Smiljanic wrote:
>
> I tried to do the conversion to lowercase inside GetFullPath by adding an
> additional bool parameter to this function. It's not perfect but seems much
> better than repeating the whole UTF8 to UTF16 and UTF16 to UTF8 conversion
> again. The problem I have is with access to CharLowerBuffW. I need Windows.h
> for this function but when I try to include it I get a bunch of errors
> because Language Extensions are disabled, /Za switch. Do I just enable them
> and include Windows.h inside and #ifdef section?
>
> On Thu, Sep 29, 2011 at 7:57 AM, Bryce Cogswell <bryceco at gmail.com> wrote:
>
>> I agree they are broken on all platforms. However, FileManager.cpp already
>> contains an #if WIN32 conditional around the code calling LowercaseString,
>> so you can use MultiByteToWideChar and CharLowerBuffW directly there, and
>> not call the LowercaseString function. I don't think there are any other
>> places where LowercaseString is called with non-ascii data, so you can punt
>> on fixing it for now.
>>
>> On Sep 28, 2011, at 10:11 PM, Nikola Smiljanic wrote:
>>
>> I have a problem with Lowercase and Uppercase functions. These are broken
>> on all platforms, not only Window, so I can't just #ifdef and use CharLowerBuffW.
>> I need a portable way to convert from UTF8 to UTF16. There is set of
>> functions inside clang/Basic/ConvertUTF, but LLVM can't depend on this. What
>> do I do?
>>
>> On Tue, Sep 27, 2011 at 5:09 AM, Bryce Cogswell <bryceco at yahoo.com>wrote:
>>
>>> I think the assert you have for _stat64i32 is fine. It is a constant
>>> expression so should compile to nothing, and the chance of the definition
>>> changing is pretty much zero.
>>>
>>> LowercaseString appears to be used by ASM printers where the output is
>>> expected to be ASCII, and then some WIN32-conditioned code in
>>> FileManager.cpp. I hate to say it but you'll probably need to convert the
>>> UTF-8 paths to wide char, lower case it using CharLowerBuffW (since it needs
>>> to match the casing rules used by NTFS), and then convert back to UTF-8.
>>>
>>> If you need to be pedantic about recognizing whether two paths are the
>>> same on Windows you also need to call GetFullPathName in order to expand any
>>> 8.3 path components, but this is an expensive function so I wouldn't do it
>>> unless absolutely necessary.
>>>
>>> There is also a call to _fullpath in there that needs to be changed to
>>> _wfullpath.
>>>
>>> The rest of the patch looks good.
>>>
>>> On Sep 23, 2011, at 9:57 AM, Nikola Smiljanic wrote:
>>>
>>> Here's a new patch that fixes all the issues mentioned before. Note that
>>> this isn't final, I didn't want to replace all calls to ::stat so that it's
>>> easier to review.
>>>
>>> I have only one more questions:
>>>
>>> 1. _wopen accepts _stat64i32 instead of stat structure. These two are
>>> exactly the same, the only difference is that stat uses time_t
>>> and _stat64i32 uses __time64_t (time_t is a typedef for __time64_t but this
>>> depends on one more macro). I check to see that size of these two structures
>>> is the same, but is there something more that I could do? I'd also like to
>>> use some kind of static assert instead of runtime assert.
>>>
>>> I've also noticed that function LowercaseString (the same applies to
>>> UppercaseString) doesn't work with UTF8 input. It's calling isupper and
>>> tolower and these functions can't handle UTF8. This should get a new issue
>>> probably?
>>>
>>> On Wed, Sep 21, 2011 at 1:41 AM, Aaron Ballman <aaron at aaronballman.com>wrote:
>>>
>>>> On Tue, Sep 20, 2011 at 4:15 PM, Nikola Smiljanic <popizdeh at gmail.com>
>>>> wrote:
>>>> > OK since this approach makes sense I'll shoot with my questions :)
>>>> > 1. Where should get_utf8_argv go and is the name of this function OK?
>>>> Right
>>>> > now the function is inside llvm::sys::fs namespace because I need
>>>> access to
>>>> > Windows.h, should I leave it there.
>>>>
>>>> I don't think it belongs there as it has nothing to do with the file
>>>> system.  However, I'm not familiar enough to know of a better place
>>>> for it either.  Hopefully someone else can chime in with an idea.
>>>>
>>>> > 2. get_utf8_argv allocates new char** representing utf8 argv, should I
>>>> > deallocate this inside driver.cpp, use std::vector<std::string>
>>>> instead or
>>>> > ignore the allocation completely?
>>>>
>>>> Perhaps some SmallVectors would make sense here?  An inner one to hold
>>>> the arg data, and an outer one to hold the list of inner vectors.
>>>> Then you can do the cleanup within the call.
>>>>
>>>> > 3. There is an #ifdef inside driver.cpp right now that I don't like.
>>>> > get_utf8_argv works for windows only, but I could change the function
>>>> to
>>>> > accept argv and return it as vector of utf8 strings. It would convert
>>>> char*
>>>> > to std::string on unix and use GetCommandLine + utf8 conversion on
>>>> windows.
>>>>
>>>> I have no concrete opinion one way or the other on this.
>>>>
>>>> > 4. Should I wrap functions like ::fstat and ::close for consistency
>>>> even
>>>> > though they don't work with paths?
>>>>
>>>> I don't believe so.
>>>>
>>>> > I'll fix everything that is formatting related. You're right, current
>>>> patch
>>>> > has Open only for windows but I'll add the other one as well.
>>>>
>>>> Thanks for the help!
>>>>
>>>> ~Aaron
>>>>
>>>
>>> <clang.patch><llvm.patch>
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/52dcf52b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clang.patch
Type: application/octet-stream
Size: 15664 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/52dcf52b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm.patch
Type: application/octet-stream
Size: 4685 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111003/52dcf52b/attachment-0001.obj>


More information about the llvm-dev mailing list