[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Fri Aug 13 10:42:27 PDT 2021

aaron.ballman added a comment.

In D106577#2943837 <https://reviews.llvm.org/D106577#2943837>, @jyknight wrote:

> In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith wrote:
>
>>> One specific example I'd like to be considered:
>>> Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?
>>
>> And David Keaton (long-time WG14 member and current convener) replied:
>>
>>> Yikes!  It does indeed sound like the library would affect the value of `__STDC_ISO_10646__` in that case.  Thanks for clarifying the details.
>>
>> There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.
>
> I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.

Isn't that scenario basically the one we're in today where the compiler is unaware of what mappings the library provides?

> And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.
>
> I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!

Does that imply that an implementation supporting that encoding can't define __STDC_ISO_10646__ because it doesn't meet the "has the same value as the short identifier" requirement?

@jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577