[LLVMdev] [cfe-dev] Reminder: 3.6 branch is coming

Dimitry Andric dimitry at andric.com
Mon Jan 12 00:07:18 PST 2015


On 10 Jan 2015, at 18:58, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote:
> 
> On 10 Jan 2015, at 17:35, Dimitry Andric <dimitry at andric.com> wrote:
>> 
>> It looks like the interpretation of what "__STDC_MB_MIGHT_NEQ_WC__" means differs between the llvm developers and the FreeBSD developers.  I'm not sure what a good solution is.
> 
> I've just read the relevant parts of the C11 spec, and it's not really clear to me what the 'basic character set' is.  There are two possible interpretations:
> 
> - The set of characters that can be represented by a char in locale "C"
> - The set of characters that can be represented by a char in *any* locale
> 
> On FreeBSD, it is correct to define __STDC_MB_MIGHT_NEQ_WC__ for the second definition, but not for the first.  Can anyone point to something in the spec that clarifies this?

I don't have any clarification, but I do want to quote some discussion from a previous email conversation with Ed Schouten and Richard Smith.  This started with me mailing Ed about this particular test failure, to which he replied:

> On 15 Oct 2014, at 14:12, Ed Schouten <ed at 80386.nl> wrote:
>> On Mon, Oct 13, 2014 at 10:57 PM, Dimitry Andric <dimitry at andric.com> wrote:
>> We talked about this a little on IRC, and the opinion seems to be that defining __STDC_MB_MIGHT_NEQ_WC__ for FreeBSD is not the right thing to do.  The standard says:
>> 
>> __STDC_MB_MIGHT_NEQ_WC__  The integer constant 1, intended to indicate that, in the encoding for wchar_t, a member of the basic character set need not have a code value equal to its value when used as the lone character in an ordinary character literal.
>> 
>> But the "basic character set" is just the whitespace characters, plus a-zA-Z0-9_{}[]#()<>%:;.?*+-/^&|∼!=,\"’, and I don't think this set is dependent on the locale at all, even with wchar_t...?
> 
> As far as I know, they are. On FreeBSD, the encoding of wchar_t
> depends on the locale entirely. This is annoying. I would have loved
> to see us use UCS-4 instead.
> 
> Even though FreeBSD does not ship with this, it should be perfectly
> feasible to come up with an EBCDIC locale that would directly map to
> the low 8 bits of wchar_t.
> 
> The test case in the LLVM tree is invalid and should be discarded. It
> erroneously assumes that the encoding of wchar_t is independent of the
> locale.

Richard then argued this makes no sense:

> On 15 Oct 2014, at 19:42, Richard Smith <richard at metafoo.co.uk> wrote:
>> On 15 Oct 2014 05:12, "Ed Schouten" <ed at 80386.nl> wrote:
> ...
>> The test case in the LLVM tree is invalid and should be discarded. It
>> erroneously assumes that the encoding of wchar_t is independent of the
>> locale.
>> 
> That makes no sense. These value are compile-time constants and cannot possibly depend on the locale.

Next, Ed remarked that wide characters are indeed locale-dependent:

> On 15 Oct 2014, at 19:50, Ed Schouten <ed at 80386.nl> wrote:
>> On Wed, Oct 15, 2014 at 7:42 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> ...
>> That makes no sense. These value are compile-time constants and cannot
>> possibly depend on the locale.
> 
> Exactly, but as far as I know, that's exactly the problem why wide
> characters are broken as implemented on FreeBSD. They are locale
> dependent, meaning that there is no way a compiler could reliably emit
> literal character/string literals.

And Richard then seemed to conclude that this was something to be solved on the FreeBSD side:

> On 15 Oct 2014, at 19:58, Richard Smith <richard at metafoo.co.uk> wrote:
>> On 15 Oct 2014 10:50, "Ed Schouten" <ed at 80386.nl> wrote:
> ...

> That is a much more fundamental problem than the value of this macro, and is a problem the FreeBSD folks will need to sort out for themselves.
> 
> Nonetheless, we need to have a fixed encoding for wide character literals, and the macro is specified as corresponding to *that* encoding. And in that encoding, narrow and wide basic source characters have the same value.

This was the end of the thread.  Now, to go back to the beginning again, when I read the C++11 standard, it mentions a "basic source character set" in 2.3 [lex.charset]:

> The basic source character set consists of 96 characters: the space character, the control characters repre- senting horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:14
> 
> abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ 0123456789 _{}[]#()<>%:;.?*+-/^&|∼!=,\"’

It looks like this is equivalent to "basic character set", since references in the document mentioning that name refer back to section 2.3.  That section also has a footnote which seems relevant:

> The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

All in all, I'm not sure whether the test case should fail when __STDC_MB_MIGHT_NEQ_WC__ is 1 and all basic source characters just 'happen' to be equal to their representation.

Either that, or maybe just XFAIL the test case on FreeBSD, to fix it.

-Dimitry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 194 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/0c6ec415/attachment.sig>


More information about the llvm-dev mailing list