[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Wed Jan 18 04:34:12 PST 2012

Okay, I've committed may patch: r148389 r148390 r148391 r148392

On Jan 17, 2012, at 5:34 PM, Eli Friedman wrote:

> Sorry about not catching the following earlier:
> 
> +  // C++03 does not restrict surrogate codepoints
> +  if (Features.CPlusPlus && !Features.CPlusPlus0x)
> +    invalid_ucn = (0xD800<=UcnVal && UcnVal<=0xDFFF);
> 
> Even if the standard doesn't explicitly disallow it, I don't think it
> makes sense to allow \uD800; we don't ever want to output invalid
> UTF-16
> 
> Otherwise, looks good; please commit.  (See
> http://llvm.org/docs/DeveloperPolicy.html for how to get commit
> access.)
> 
> -Eli
> 
> 
> On Tue, Jan 17, 2012 at 2:27 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> Do these patches need anything more done before the changes can be committed?
>> 
>> - Seth
>> 
>> On Jan 12, 2012, at 7:57 PM, Seth Cantrell wrote:
>> 
>>> There shouldn't be since the hex escape processing limits values to what can be held given CharWidth. Removing that means I can remove the duplicate logic for calculating available_bits too. Here's a new patch 0001 that does that. I also made a small change in patch 0004.
>>> 
>>> make test passes with these on top of commit b030b1949f "Revert accidental commit"
>>> 
>>> On Jan 12, 2012, at 7:22 PM, Eli Friedman wrote:
>>> 
>>>> On Wed, Jan 11, 2012 at 8:35 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>>> Alright, characters for which the appropriate encoding can't be represented as a single value of the appropriate type are now disallowed in character literals.
>>>>> 
>>>>> so now '\u2031' is not allowed (not even in C where the literal has type int which could represent the value) and L'\U00010000' is not allowed. Also replacing these UCNs with the actual characters results in exactly the same behavior.
>>>> 
>>>> Okay, that works.
>>>> 
>>>> +  if (!HadError && (multi_char_too_long || available_bits < needed_bits)) {
>>>> +    PP.Diag(Loc,diag::warn_char_constant_too_large);
>>>> 
>>>> Are there actually any cases where "available_bits < needed_bits" is
>>>> true in the current version of your patch?
>>>> 
>>>> -Eli
>>>> 
>>>>> 
>>>>> On Jan 10, 2012, at 3:59 PM, Eli Friedman wrote:
>>>>> 
>>>>>> On Tue, Jan 10, 2012 at 4:05 AM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>>>>> whoops, that should be "anything that indicates '\U0010FFFD' isn't perfectly valid"
>>>>>>> 
>>>>>>> Accepting larger Unicode escapes is not new with this patch (I tried the clang installed with Xcode 4.2, Apple clang version 3.0 (tags/Apple/clang-211.12) (based on LLVM 3.0svn), and `int i = '\U001F306';` gives i the value 0x001F306. Although I don't have a use-case or anything my preference is to allow the larger unicode escapes.
>>>>>>> 
>>>>>>> If you want them excluded just let me know the ranges.
>>>>>> 
>>>>>> Accepting it and doing something different from gcc seems likely to
>>>>>> cause issues if someone is accidentally depending on gcc's behavior.
>>>>>> I think we should either reject it or do the same thing as gcc.
>>>>>> 
>>>>>> -Eli
>>> <0001-Improves-support-for-Unicode-in-character-literals.patch><0002-Fix-char-literal-types-in-C.patch><0003-stop-claiming-unicode-escape-sequences-are-too-long-.patch><0004-Add-and-update-tests-for-character-literals.patch>
>>