[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Tue Jul 27 08:01:51 PDT 2021

jyknight added a comment.

BTW, looks like the standard wording came from:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_273.htm

which indeed seems to suggest that the intent was to:

1. ensure that WCHAR_MAX is at least the maximum character actually defined so far by the standard (which in past versions was 0xffff, and in current versions is 0x10ffff).
2. ensure that for each of those characters defined by the standard, that it has the same numeric value stored in a wchar_t as the value the standard specifies.

In D106577#2906542 <https://reviews.llvm.org/D106577#2906542>, @rsmith wrote:

> The "old libc" case is for old versions of glibc that put the macro in `features.h` instead of in `stdc-predef.h`. The macros in `stdc-predef.h` aren't a problem until / unless we start auto-including that header.

The `features.h` header in every version of glibc since `stdc-predef.h` was split off has had `#include <stdc-predef.h>` in it. If the redefinition is a problem, it's still a problem in current versions.

> In D106577#2905755 <https://reviews.llvm.org/D106577#2905755>, @jyknight wrote:
>
>> In D106577#2904960 <https://reviews.llvm.org/D106577#2904960>, @rsmith wrote:
>>
>>> One benefit we don't get with this approach is providing the right value for the macro (without paying the cost of always including `stdc-predefs.h`).
>>
>> What do you mean by "right value", though? As Aaron pointed out, the value seems only dependent upon what characters can fit into a wchar_t, which is independent of what unicode version the libc supports.
>
> I don't see how that follows from the definition in the C standard; it says "every character in the Unicode required set, when stored in an object of type `wchar_t`, has the same value as the short identifier of that character". This doesn't say anything about character or string literals, and for example `mbstowcs` stores characters in objects of type `wchar_t` too (it "stores not more than `n` wide characters into the array pointed to by `pwcs`"), so unless `mbstowcs` does the right thing I don't see how we can claim support for a new Unicode standard version. 
> As far as I can tell, this macro is documenting a property of the complete implementation (compiler plus standard library), and should be set to the minimum of the version supported by the compiler and the version supported by the stdlib. I think it's OK for the compiler to say it supports *any* version, though, because we don't expect future Unicode versions to require any changes on our part. But they may require standard library changes.

But that's exactly it -- there are no library OR compiler changes changes required to remain conformant with this property when a new standard version is released. The range of values wchar_t needs to represent won't change. Even considering mbstowcs, there's no problem because it will already do the right thing, with zero changes, no matter how many new characters are defined within that valid range of 0x0-0x10ffff -- assuming that it does store unicode ordinal values into wchar_t in the first place. UTF-8/16/32 encoding and decoding are agnostic to which characters have been defined.

Of course, the library does need to make certain other changes corresponding to a new version, e.g. updating the tables for iswalpha to return true for newly defined alphabetical characters, but that functionality seems irrelevant to this define.

> If Aaron's checked with WG14 and the intent is for this to only constrain how literals are represented, and not the complete implementation, then I'm entirely fine with us defining the macro ourselves. But that's not the interpretation that several other vendors have taken. If we're confident that the intent is just that this macro lists (effectively) the latest version of the Unicode standard that we've heard of, we should let the various libc vendors that currently define the macro know that they're doing it wrong and the definition belongs in the compiler.

It's surely intended to cover the complete system, since the standard doesn't consider "compiler" vs "libc" as separate things, they're both just components of the "implementation". But as per above comments, I don't think that changes the conclusion here.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577