<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 13, 2021 at 7:42 PM Aaron Ballman via Phabricator <<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">aaron.ballman added a comment.<br>

<br>

In D106577#2943837 <<a href="https://reviews.llvm.org/D106577#2943837" rel="noreferrer" target="_blank">https://reviews.llvm.org/D106577#2943837</a>>, @jyknight wrote:<br>

<br>

> In D106577#2904960 <<a href="https://reviews.llvm.org/D106577#2904960" rel="noreferrer" target="_blank">https://reviews.llvm.org/D106577#2904960</a>>, @rsmith wrote:<br>

><br>

>>> One specific example I'd like to be considered:<br>

>>> Suppose the C standard library implementation's mbstowcs converts a certain multi-byte character C to somewhere in the Unicode private use area, because Unicode version N doesn't have a corresponding character. Suppose further that the compiler is aware of Unicode version N+1, in which a character corresponding to C was added. Is an implementation formed by that combination of compiler and standard library, that defines `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it represents character C as something other than the corresponding short name from Unicode version N+1?<br>

>><br>

>> And David Keaton (long-time WG14 member and current convener) replied:<br>

>><br>

>>> Yikes!  It does indeed sound like the library would affect the value of `__STDC_ISO_10646__` in that case.  Thanks for clarifying the details.<br>

>><br>

>> There was no further discussion after that point, so I think the unofficial WG14 stance is that the compiler and the library need to collude on setting the value of that macro.<br>

><br>

> I don't think that scenario is valid. MBCS-to-unicode mappings are a part of the definition of the MBCS (sometimes officially, sometimes de-facto defined by major vendors), not in the definition of Unicode.<br>

<br>

Isn't that scenario basically the one we're in today where the compiler is unaware of what mappings the library provides?<br>

<br>

> And in fact, we have a real-life example of this: the GB18030 encoding. That standard specifies 24 characters mappings to private-use-area unicode codepoints in the most recent version, GB18030-2005. (Which is down from 80 PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet, a new version of Unicode coming out will not affect that. Rather, I should say, DID NOT affect that -- all of those 24 characters mapped to PUAs in GB18030-2005 were actually assigned official unicode codepoints by 2005 (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still maps those to PUA code-points. The only way that can change is if GB18030 gets updated.<br>

><br>

> I do note that some implementations (e.g. glibc) have taken it upon themselves to modify the official GB18030 character mapping table, and to decode those 24 codepoints to the newly-defined unicode characters, instead of the specified PUA codepoints. But there's no way that can be described as a requirement -- it's not even technically correct!<br>

<br>

Does that imply that an implementation supporting that encoding can't define __STDC_ISO_10646__ because it doesn't meet the "has the same value as the short identifier" requirement?<br></blockquote><div><br></div><div>FYI, there should be a revision of GB18030 this year that will not use the PUA anymore.</div><div>In general the PUA is considered "not for interchange" so if you have a system that interprets PUA codepoints differently at different points in time you are outside of any guarantees provided by Unicode.</div><div>GB18030-2005 is a weird exception as in general the standard library should never transcode to the PUA as this is not portable.<br></div><div><br></div><div>GB18030, despite having a 1-1 mapping to unicode has to be considered a distinct character set from Unicode, as such, a system where wide literals are GB18030 encoded should not define<br></div><div>__STDC_ISO_10646__</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

@jyknight, are you on the WG14 reflectors btw? Would you like to carry on with this discussion over there (or would you like me to convey your viewpoints on your behalf)?<br>

<br>

<br>

Repository:<br>

  rG LLVM Github Monorepo<br>

<br>

CHANGES SINCE LAST ACTION<br>

  <a href="https://reviews.llvm.org/D106577/new/" rel="noreferrer" target="_blank">https://reviews.llvm.org/D106577/new/</a><br>

<br>

<a href="https://reviews.llvm.org/D106577" rel="noreferrer" target="_blank">https://reviews.llvm.org/D106577</a><br>

<br>

</blockquote></div></div>