[cfe-dev] Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Fri Jun 17 12:51:45 PDT 2011

For the BOM detection, I meant that we should always go with the user
specified -finput-charset option, simply issuing a warning if it
contradicts what the user told us. gcc always goes with what the user
specified in -finput-charset.

As for edge cases, consider something like:

char c = '(some Han character that takes 4 bytes)';
char d = '0xFFC';

Here, gcc will trim the character. You get:

test.c:6:12: warning: character constant too long for its type
test.c:6:13: warning: character constant too long for its type

This seems to be tied to the -Wcharacter-truncation option.

Another example is:

wchar_t* str = L"abcdef\x54ghijklmnop";

If sizeof(wchar_t) == 4, we normally expect characters to be aligned
to 4 byte boundary. The string suggests that a single byte is inserted
in the middle which leaves the compiler with an issue of
interpretation. It seems that gcc zero-pads the value to align it.

A more complex case is when you take that same string and tell gcc
-fwide-exec-charset=ASCII. This doesn't make a lot of sense, but gcc
happily takes it.

In these cases, we simply have to be sure we correctly gcc's behavior
and make sure we have adequate test coverage.

-Scott

On Fri, Jun 17, 2011 at 9:43 AM, Chris Lattner <clattner at apple.com> wrote:
>
> On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:
>
>> Thanks for the reply Chris.
>>
>> I was going to put off universal-character-names for now. It should be
>> easy to add afterward.
>
> Makes sense.
>
>> For the BOM and input character sets the general scheme I have at the moment is:
>>
>> * Check for BOM (warning if it contradicts the inputcharset option)
>
> Ok, I don't know GCC's policy on this (it's best to follow it for compatibility unless it is completely insane) but it seems reasonable that the -finput-charset option should only specify a charset for files without a BOM.  If a file has a BOM, we should probably follow it.
>
>> * If inputcharset option is UTF-8, the locale specified encoding is
>> UTF-8 or there is a UTF-8 BOM, just validate the input (performance
>> hit later on if there can be invalid UTF-8)
>
> If I understand correctly, the only invalid UTF8 occurs with high characters.  This can probably be inlined into the lexer at near-zero cost to avoid a prepass.
>
>> * If user specified a non-UTF-8 inputcharset, use iconv to convert
>> (ignoring the BOM, which might be a false positive)
>> * For other BOM, use iconv to convert
>
> Yep.
>
>> The fallback is to check if every byte is < 128, using iconv or the
>> windows API to convert from the native encoding if a high bit is set.
>> This appears to be a valid assumption on everything except IBM
>> machines with native ebcdic, which I'm ignoring since Clang won't
>> compile anyways.
>
> Yes, we don't care about EBCDIC. If someone comes around with a deep passion for it later, we can deal with it then.
>
>> The main issue that I've run into is compatibility. My experimentation
>> with gcc shows a lot of edge cases such as specifying a
>> wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
>> a string that violates the alignment.
>
> I'm not sure what you mean here,
>
> -Chris
>