[cfe-dev] [LLVMdev] Unicode path handling on Windows

Nikola Smiljanic popizdeh at gmail.com
Mon Oct 3 12:19:32 PDT 2011


CharLowerW does the right thing. But I still need Windows.h to use it :)

On Mon, Oct 3, 2011 at 8:43 PM, Bryce Cogswell <bryceco at gmail.com> wrote:

> Locale-specific is not what we want, but I don't believe Windows exposes an
> alternative API that does what we want. (Does CharLower give a different
> answer than tolower?)
>
> However, looking over the FileManager code a little more I'm not even sure
> using the path is the best solution, it seems it would be better to use
> inode like the unix code does. Windows doesn't support inode (the s_ino
> field in stat), but it does have nFileIndexHigh/nFileIndexLow which are
> exposed via ::GetFileInformationByHandle and is basically the same thing.
> This would require refactoring FileSystemStatCache to use a new structure in
> place of stat that could be shared between Windows and Unix. This would be a
> lot of small changes but seems like it would be fairly straightforward.
>
>
> On Oct 3, 2011, at 9:21 AM, Nikola Smiljanic wrote:
>
> towlower doesn't seem to work with my test string in Cyrillic. This
> function does locale-specific conversion, is this what we want?
>
> Here's the whole thing, with all the calls to ::stat replaced with
> llvm::sys::fs::Stat.
>
> On Fri, Sep 30, 2011 at 8:04 PM, Bryce Cogswell <bryceco at gmail.com> wrote:
>
>> You can use _iswupper and _towlower instead of CharLowerBuffW. They don't
>> require windows.h and work with /Za.
>>
>> On Sep 30, 2011, at 12:24 AM, Nikola Smiljanic wrote:
>>
>> I tried to do the conversion to lowercase inside GetFullPath by adding an
>> additional bool parameter to this function. It's not perfect but seems much
>> better than repeating the whole UTF8 to UTF16 and UTF16 to UTF8 conversion
>> again. The problem I have is with access to CharLowerBuffW. I need Windows.h
>> for this function but when I try to include it I get a bunch of errors
>> because Language Extensions are disabled, /Za switch. Do I just enable them
>> and include Windows.h inside and #ifdef section?
>>
>> On Thu, Sep 29, 2011 at 7:57 AM, Bryce Cogswell <bryceco at gmail.com>wrote:
>>
>>> I agree they are broken on all platforms. However, FileManager.cpp
>>> already contains an #if WIN32 conditional around the code calling
>>> LowercaseString, so you can use MultiByteToWideChar and CharLowerBuffW
>>> directly there, and not call the LowercaseString function. I don't think
>>> there are any other places where LowercaseString is called with non-ascii
>>> data, so you can punt on fixing it for now.
>>>
>>> On Sep 28, 2011, at 10:11 PM, Nikola Smiljanic wrote:
>>>
>>> I have a problem with Lowercase and Uppercase functions. These are broken
>>> on all platforms, not only Window, so I can't just #ifdef and use CharLowerBuffW.
>>> I need a portable way to convert from UTF8 to UTF16. There is set of
>>> functions inside clang/Basic/ConvertUTF, but LLVM can't depend on this. What
>>> do I do?
>>>
>>> On Tue, Sep 27, 2011 at 5:09 AM, Bryce Cogswell <bryceco at yahoo.com>wrote:
>>>
>>>> I think the assert you have for _stat64i32 is fine. It is a constant
>>>> expression so should compile to nothing, and the chance of the definition
>>>> changing is pretty much zero.
>>>>
>>>> LowercaseString appears to be used by ASM printers where the output is
>>>> expected to be ASCII, and then some WIN32-conditioned code in
>>>> FileManager.cpp. I hate to say it but you'll probably need to convert the
>>>> UTF-8 paths to wide char, lower case it using CharLowerBuffW (since it needs
>>>> to match the casing rules used by NTFS), and then convert back to UTF-8.
>>>>
>>>> If you need to be pedantic about recognizing whether two paths are the
>>>> same on Windows you also need to call GetFullPathName in order to expand any
>>>> 8.3 path components, but this is an expensive function so I wouldn't do it
>>>> unless absolutely necessary.
>>>>
>>>> There is also a call to _fullpath in there that needs to be changed to
>>>> _wfullpath.
>>>>
>>>> The rest of the patch looks good.
>>>>
>>>> On Sep 23, 2011, at 9:57 AM, Nikola Smiljanic wrote:
>>>>
>>>> Here's a new patch that fixes all the issues mentioned before. Note that
>>>> this isn't final, I didn't want to replace all calls to ::stat so that it's
>>>> easier to review.
>>>>
>>>> I have only one more questions:
>>>>
>>>> 1. _wopen accepts _stat64i32 instead of stat structure. These two are
>>>> exactly the same, the only difference is that stat uses time_t
>>>> and _stat64i32 uses __time64_t (time_t is a typedef for __time64_t but this
>>>> depends on one more macro). I check to see that size of these two structures
>>>> is the same, but is there something more that I could do? I'd also like to
>>>> use some kind of static assert instead of runtime assert.
>>>>
>>>> I've also noticed that function LowercaseString (the same applies to
>>>> UppercaseString) doesn't work with UTF8 input. It's calling isupper and
>>>> tolower and these functions can't handle UTF8. This should get a new issue
>>>> probably?
>>>>
>>>> On Wed, Sep 21, 2011 at 1:41 AM, Aaron Ballman <aaron at aaronballman.com>wrote:
>>>>
>>>>> On Tue, Sep 20, 2011 at 4:15 PM, Nikola Smiljanic <popizdeh at gmail.com>
>>>>> wrote:
>>>>> > OK since this approach makes sense I'll shoot with my questions :)
>>>>> > 1. Where should get_utf8_argv go and is the name of this function OK?
>>>>> Right
>>>>> > now the function is inside llvm::sys::fs namespace because I need
>>>>> access to
>>>>> > Windows.h, should I leave it there.
>>>>>
>>>>> I don't think it belongs there as it has nothing to do with the file
>>>>> system.  However, I'm not familiar enough to know of a better place
>>>>> for it either.  Hopefully someone else can chime in with an idea.
>>>>>
>>>>> > 2. get_utf8_argv allocates new char** representing utf8 argv, should
>>>>> I
>>>>> > deallocate this inside driver.cpp, use std::vector<std::string>
>>>>> instead or
>>>>> > ignore the allocation completely?
>>>>>
>>>>> Perhaps some SmallVectors would make sense here?  An inner one to hold
>>>>> the arg data, and an outer one to hold the list of inner vectors.
>>>>> Then you can do the cleanup within the call.
>>>>>
>>>>> > 3. There is an #ifdef inside driver.cpp right now that I don't like.
>>>>> > get_utf8_argv works for windows only, but I could change the function
>>>>> to
>>>>> > accept argv and return it as vector of utf8 strings. It would convert
>>>>> char*
>>>>> > to std::string on unix and use GetCommandLine + utf8 conversion on
>>>>> windows.
>>>>>
>>>>> I have no concrete opinion one way or the other on this.
>>>>>
>>>>> > 4. Should I wrap functions like ::fstat and ::close for consistency
>>>>> even
>>>>> > though they don't work with paths?
>>>>>
>>>>> I don't believe so.
>>>>>
>>>>> > I'll fix everything that is formatting related. You're right, current
>>>>> patch
>>>>> > has Open only for windows but I'll add the other one as well.
>>>>>
>>>>> Thanks for the help!
>>>>>
>>>>> ~Aaron
>>>>>
>>>>
>>>> <clang.patch><llvm.patch>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> <clang.patch><llvm.patch>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111003/0636cfb8/attachment.html>


More information about the cfe-dev mailing list