[PATCH] D106215: Make wide multi-character character literals ill-formed in C++ mode

Fri Aug 13 06:38:55 PDT 2021

cor3ntin marked an inline comment as done.
cor3ntin added a comment.

In D106215#2943653 <https://reviews.llvm.org/D106215#2943653>, @aaron.ballman wrote:

> In D106215#2943631 <https://reviews.llvm.org/D106215#2943631>, @cor3ntin wrote:
>
>> In D106215#2943611 <https://reviews.llvm.org/D106215#2943611>, @aaron.ballman wrote:
>>
>>> I think that C and C++ should behave the same here; at least, I don't see any reason why they should have different capabilities.
>>
>> I agree but as WG14 hasn't weighted in I didn't want to make that call.
>> What do you think?
>
> My reading of C2x is that this is implementation-defined there as well.
>
> 6.4.4.4p13:
>
> A wide character constant prefixed by the letter L has type wchar_t, an integer type defined in the
> <stddef.h> header; a wide character constant prefixed by the letter u or U has type char16_t or
> char32_t, respectively, unsigned integer types defined in the <uchar.h> header. The value of a
> wide character constant containing a single multibyte character that maps to a single member of the
> extended execution character set is the wide character corresponding to that multibyte character,
> as defined by the mbtowc, mbrtoc16, or mbrtoc32 function as appropriate for its type, with an
> implementation-defined current locale. The value of a wide character constant containing more
> than one multibyte character or a single multibyte character that maps to multiple members of
> the extended execution character set, or containing a multibyte character or escape sequence not
> represented in the extended execution character set, is implementation-defined.
>
> Do you agree?

Yes, I agree.
I think clang could make it ill-formed if it wanted to!
If we want to do that we could probably remove some more code :)

>>> The paper said that there is no expected code breakage from this change, but have you tried building a diverse corpus of code (like a distro's worth of packages) under this patch to see if anything actually breaks in practice? (I don't expect breakage that isn't identifying an actual issue in the code, but having some verification would be appreciated.) This would also help to identify whether the change is appropriate for C as well.
>>
>> We have done regexes over various repositories (every vcpkg package) with no match. Not running a complete compiler
>
> Regexes are a good start but they miss the goofy (and sometimes awful) stuff that people do with token pasting, line continuations, and other random tricks. Would you be willing to try this as an experiment, or am I asking too much? :-) My thinking is that if we don't see any breakage from compiling a diverse corpus of code, we've done enough due diligence to suggest this is safe for both C and C++, but if we see some breakage, we can either identify that there's some valid use for this that we've not considered (less likely) and would be informative for both WG21 and WG14, or we can identify that we helped find bugs in real world code (more likely) which is also good feedback for the committees.

Unless there is a script to do that easily, I'm not sure I'll be able to get to it any time soon.
But really, there is 0 use for these things! And you can't do much goofiness  `L  ## 'ab'` certainly - but that wouldn't be very useful either

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106215/new/

https://reviews.llvm.org/D106215