[LLVMdev] [cfe-dev] Unicode path handling on Windows

Török Edwin edwintorok at gmail.com
Mon Oct 3 14:18:03 PDT 2011


On 10/03/2011 11:59 PM, Joachim Durchholz wrote:
> Am 03.10.2011 22:12, schrieb Nikola Smiljanic:
>> How about this:
>>
>> for (int i = 0; i != NumWChars; ++i)
>>          absPath[i] = std::tolower(absPath[i], std::locale());
>>
>> seems to be working just fine?
> 
> You have two assumptions here:
> 
> Assumption 1: For each lowercase character, there is an equivalent 
> uppercase character, and vice versa.
> This is not true in half a dozen languages according to
> ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt .
> 
> Assumption 2: The transformation from lower case to upper case can be 
> done for each character individually, without considering context.
> This is not true in a couple of languages according to SpecialCasing.txt.
> 
> Do not do that. If you get complaints, they will be about scripts that 
> you can't type on your keyboard, and that you know nothing about so you 
> don't even know what the right behaviour would have been.
> Rely on the relevant Unicode library. Which one that would be, and which 
> functions to call, depends on what you need that to-lowercase 
> transformation for. (It also depends on whether the names you get are 
> already normalized or not; I'd want to run a normalization pass on the 
> names first just to be on the safe side.)

Does Windows do proper Unicode to-lowercase, or does it just lowercase A-Z?

>From reading the below article I get that you can create filenames that would be considered
identical under Unicode to-lowercase rules, but yet they exist as different files:
https://blogs.msdn.com/b/michkap/archive/2005/10/17/481600.aspx

Best regards,
--Edwin



More information about the llvm-dev mailing list