[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Fri Aug 20 22:28:12 PDT 2021

jyknight added a comment.

In D106577#2944086 <https://reviews.llvm.org/D106577#2944086>, @aaron.ballman wrote:

>> I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.
>
> Isn't that scenario basically the one we're in today where the compiler is unaware of what mappings the library provides?

What I mean is: unicode does not define the mappings of a legacy MBCS byte sequence to a unicode character. It's simply out of scope. Only 3 encodings are defined by the Unicode standard (UTF-8, UTF-16, UTF-32). Mappings for other encodings are defined, instead, either by their own standard, or else simply chosen arbitrarily by a vendor.

>> And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.
>>
>> I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!
>
> Does that imply that an implementation supporting that encoding can't define __STDC_ISO_10646__ because it doesn't meet the "has the same value as the short identifier" requirement?

No. The fact that the GB18030 encoding has an unfortunate mapping of its bytes to unicode characters does not change anything about `__STD_ISO_10646__`. It does not affect, "every character in the Unicode required set, when stored in an object of type wchar_t, has the same value as the short identifier of that character" at all. All we're talking about here is differences of opinion between implementations as to which unicode character a given GB18030 byte sequence should to be translated as -- not the way in which a unicode character is stored in a wchar_t.

> @jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?

I'm not. I'd be happy to have you convey my viewpoints.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577