[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Fri Aug 13 08:09:04 PDT 2021

jyknight added a comment.

In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith wrote:

>> One specific example I'd like to be considered:
>> Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?
>
> And David Keaton (long-time WG14 member and current convener) replied:
>
>> Yikes!  It does indeed sound like the library would affect the value of `__STDC_ISO_10646__` in that case.  Thanks for clarifying the details.
>
> There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.

I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.

And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.

I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577