[PATCH] D106577: [clang] Define __STDC_ISO_10646__

Fri Jul 23 04:40:26 PDT 2021

aaron.ballman requested changes to this revision.
aaron.ballman added a comment.
This revision now requires changes to proceed.

In D106577#2899574 <https://reviews.llvm.org/D106577#2899574>, @cor3ntin wrote:

> In D106577#2898967 <https://reviews.llvm.org/D106577#2898967>, @hubert.reinterpretcast wrote:
>
>> Every character in the Unicode required set encoded in what way? To say that such a character is stored in an object of type `wchar_t` means that interpreting the `wchar_t` yields that stored character. Methods to determine the interpretation of the stored `wchar_t` value include locale-sensitive functions such as `wcstombs` (and thus is tied to libc).
>
> "has the same value as the short identifier of that character." implies UTF-32.
> There is no mention of interpretation here, the *value* is the same. As in, when casting to an integer type you get the code point value.

This is how I interpret the words from the standard as well. I think it's purely about the bit width of `wchar_t` and whether it's wide enough to hold all Unicode code points as of a particular Unicode standard release.

I tried to do some archeology to see how this predefined macro came into existence. It was added in C99 at a time before we seemed to be collecting editors reports and there are no obvious papers on the topic, so I don't know what proposal added the feature. The C99 rationale document does not mention the macro at all, but from my reading of the rationale, it seems possible that this macro is related to the introduction of UCNs and whether `\Unnnnnnnn` can be stored in a `wchar_t`.

One thing I did find when doing my research though was: https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html which says, in part,

  The standard defines at least a macro __STDC_ISO_10646__ that is only defined on systems where the wchar_t type encodes ISO 10646 characters. If this symbol is not defined one should avoid making assumptions about the wide character representation.

This matches the interpretation that the libc encoding is salient

but... we still need to define what happens in freestanding environments where there is no libc with character conversion functions, so it also sounds like it's partially in the realm of the compiler.

> *Storing* that value might involve either assigning from a wide-character literal or `mbrtowc`.
> Both methods imply some transcoding,  the latter of which could be affected by locale such that it would store a different character, but again, is it related to this wording?
>
> Note that by virtue of being a macro this cannot possibly be affected by locale.
>
> A few scenarios
>
> - The encoding of wide literal as determined by clang is not utf-32, the macro should be defined by neither the compiler nor the library
> - The encoding of wide literals as determined by the compiler is utf-32, libc agrees... this works as intended
> - The encoding of wide literals as determined by the compiler is utf-32, libc disagrees... nothing good can come of that.
>
> The compiler and the libc have to agree here.
> The library cannot (should not) define this macro without knowing the wide literal encoding.

I agree that the compiler and libc need to agree on the encoding.

> Note that both standards imply that these macros should be defined when relevant independently of the environment which includes hosted and non-Linux+glibc platforms. So relying on a specific glibc implementation
> seems dubious. Especially as glibc will *always* define that macro

I think the point was more about "who is generally responsible for defining this macro, the compiler or the library" as opposed to it being a glibc thing specifically. I notice that musl also defines the macro (https://git.musl-libc.org/cgit/musl/tree/include/stdc-predef.h#n4).

> Now, I agree that the compiler and the library should ideally expose the same *value* for this macro (although I struggle to find code that actually relies on the value)
>
> When D34158 <https://reviews.llvm.org/D34158> as mentioned by @jyknight lands, the value will be set to that of the library version thereby overriding the compiler default.
> On other systems, the value will be set to the library version whenever the library is included.

I think that's the correct behavior. The compiler says "my wchar_t encodes ISO 10646" and the library has the chance to say "my wide char functions expect something else" if need be.

Given that there's two people who think this macro relates to the standard library, I'm going to mark review as needing changes so we don't accidentally land it. I think we should ask for an interpretation on the WG14 reflectors and come back once we have more information.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106577/new/

https://reviews.llvm.org/D106577