[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Corentin via cfe-commits cfe-commits at lists.llvm.org
Fri Aug 13 11:05:03 PDT 2021


On Fri, Aug 13, 2021 at 7:42 PM Aaron Ballman via Phabricator <
reviews at reviews.llvm.org> wrote:

> aaron.ballman added a comment.
>
> In D106577#2943837 <https://reviews.llvm.org/D106577#2943837>, @jyknight
> wrote:
>
> > In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith
> wrote:
> >
> >>> One specific example I'd like to be considered:
> >>> Suppose the C standard library implementation's mbstowcs converts a
> certain multi-byte character C to somewhere in the Unicode private use
> area, because Unicode version N doesn't have a corresponding character.
> Suppose further that the compiler is aware of Unicode version N+1, in which
> a character corresponding to C was added. Is an implementation formed by
> that combination of compiler and standard library, that defines
> `__STDC_ISO_10646__` to N+1, conforming? Or is it non-conforming because it
> represents character C as something other than the corresponding short name
> from Unicode version N+1?
> >>
> >> And David Keaton (long-time WG14 member and current convener) replied:
> >>
> >>> Yikes!  It does indeed sound like the library would affect the value
> of `__STDC_ISO_10646__` in that case.  Thanks for clarifying the details.
> >>
> >> There was no further discussion after that point, so I think the
> unofficial WG14 stance is that the compiler and the library need to collude
> on setting the value of that macro.
> >
> > I don't think that scenario is valid. MBCS-to-unicode mappings are a
> part of the definition of the MBCS (sometimes officially, sometimes
> de-facto defined by major vendors), not in the definition of Unicode.
>
> Isn't that scenario basically the one we're in today where the compiler is
> unaware of what mappings the library provides?
>
> > And in fact, we have a real-life example of this: the GB18030 encoding.
> That standard specifies 24 characters mappings to private-use-area unicode
> codepoints in the most recent version, GB18030-2005. (Which is down from 80
> PUA mappings in its predecessor encoding GBK, and 25 in GB18030-2000.) Yet,
> a new version of Unicode coming out will not affect that. Rather, I should
> say, DID NOT affect that -- all of those 24 characters mapped to PUAs in
> GB18030-2005 were actually assigned official unicode codepoints by 2005
> (some in Unicode 3.1, some in Unicode 4.1). But no matter -- GB18030 still
> maps those to PUA code-points. The only way that can change is if GB18030
> gets updated.
> >
> > I do note that some implementations (e.g. glibc) have taken it upon
> themselves to modify the official GB18030 character mapping table, and to
> decode those 24 codepoints to the newly-defined unicode characters, instead
> of the specified PUA codepoints. But there's no way that can be described
> as a requirement -- it's not even technically correct!
>
> Does that imply that an implementation supporting that encoding can't
> define __STDC_ISO_10646__ because it doesn't meet the "has the same value
> as the short identifier" requirement?
>

FYI, there should be a revision of GB18030 this year that will not use the
PUA anymore.
In general the PUA is considered "not for interchange" so if you have a
system that interprets PUA codepoints differently at different points in
time you are outside of any guarantees provided by Unicode.
GB18030-2005 is a weird exception as in general the standard library should
never transcode to the PUA as this is not portable.

GB18030, despite having a 1-1 mapping to unicode has to be considered a
distinct character set from Unicode, as such, a system where wide literals
are GB18030 encoded should not define
__STDC_ISO_10646__


>
> @jyknight, are you on the WG14 reflectors btw? Would you like to carry on
> with this discussion over there (or would you like me to convey your
> viewpoints on your behalf)?
>
>
> Repository:
>   rG LLVM Github Monorepo
>
> CHANGES SINCE LAST ACTION
>   https://reviews.llvm.org/D106577/new/
>
> https://reviews.llvm.org/D106577
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20210813/d57fe2ef/attachment.html>


More information about the cfe-commits mailing list