[cfe-dev] Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Fri Jun 17 09:43:47 PDT 2011

On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:

> Thanks for the reply Chris.
> 
> I was going to put off universal-character-names for now. It should be
> easy to add afterward.

Makes sense.

> For the BOM and input character sets the general scheme I have at the moment is:
> 
> * Check for BOM (warning if it contradicts the inputcharset option)

Ok, I don't know GCC's policy on this (it's best to follow it for compatibility unless it is completely insane) but it seems reasonable that the -finput-charset option should only specify a charset for files without a BOM.  If a file has a BOM, we should probably follow it.

> * If inputcharset option is UTF-8, the locale specified encoding is
> UTF-8 or there is a UTF-8 BOM, just validate the input (performance
> hit later on if there can be invalid UTF-8)

If I understand correctly, the only invalid UTF8 occurs with high characters.  This can probably be inlined into the lexer at near-zero cost to avoid a prepass.

> * If user specified a non-UTF-8 inputcharset, use iconv to convert
> (ignoring the BOM, which might be a false positive)
> * For other BOM, use iconv to convert

Yep.

> The fallback is to check if every byte is < 128, using iconv or the
> windows API to convert from the native encoding if a high bit is set.
> This appears to be a valid assumption on everything except IBM
> machines with native ebcdic, which I'm ignoring since Clang won't
> compile anyways.

Yes, we don't care about EBCDIC. If someone comes around with a deep passion for it later, we can deal with it then.

> The main issue that I've run into is compatibility. My experimentation
> with gcc shows a lot of edge cases such as specifying a
> wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
> a string that violates the alignment.

I'm not sure what you mean here,

-Chris