[cfe-dev] Almost there...

Sun Jun 7 01:40:17 PDT 2009

On Sun, 7 Jun 2009 17:18:41 +0900, Neil Booth <neil at daikokuya.co.uk> wrote:
> Eli Friedman wrote:-
> 
>> >> Reason I am thinking about this now is what does a d-char mean for a
>> >> char32_t string? ?Assuming we can read a UTF32 formatted source file,
>> >> those
>> >
>> > You should be able to assume the basic character set is single byte;
>> > both C and C++ require this. ?So no UTF32 source files.
>> 
>> I don't see any connection between the basic character set and the
>> encoding of the source file.
> 
> The source character set is generally understood to be the character
> set the user interacts with their terminal, editor etc.
> 
> http://www.dinkumware.com/manuals/?manual=compleat&page=charset.html
> 
> Each member of the basic character set is required to be represented
> as a single byte in the source character set.

The source character set is irrelevant, at least going by the C++ standard.
The very first phase of translation (C++ 2.2p1, bullet point 1) specifies
an implementation-defined mapping of physical source file characters to the
basic source character set. Making that mapping a UTF-32 to UTF-8 coding is
perfectly valid.
Interestingly enough, the standard says that any character not in the basic
set must be encoded as a ucn. That sounds impractical, so I guess since we
want to use UTF-8 internally anyway, we should make use of the as-if rule
and instead represent everything, including ucns in the original source, in
its real UTF-8 encoding.

Sebastian