[cfe-dev] Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?
Chris Lattner
clattner at apple.com
Fri Jun 17 09:43:47 PDT 2011
On Jun 15, 2011, at 12:10 AM, Scott Conger wrote:
> Thanks for the reply Chris.
>
> I was going to put off universal-character-names for now. It should be
> easy to add afterward.
Makes sense.
> For the BOM and input character sets the general scheme I have at the moment is:
>
> * Check for BOM (warning if it contradicts the inputcharset option)
Ok, I don't know GCC's policy on this (it's best to follow it for compatibility unless it is completely insane) but it seems reasonable that the -finput-charset option should only specify a charset for files without a BOM. If a file has a BOM, we should probably follow it.
> * If inputcharset option is UTF-8, the locale specified encoding is
> UTF-8 or there is a UTF-8 BOM, just validate the input (performance
> hit later on if there can be invalid UTF-8)
If I understand correctly, the only invalid UTF8 occurs with high characters. This can probably be inlined into the lexer at near-zero cost to avoid a prepass.
> * If user specified a non-UTF-8 inputcharset, use iconv to convert
> (ignoring the BOM, which might be a false positive)
> * For other BOM, use iconv to convert
Yep.
> The fallback is to check if every byte is < 128, using iconv or the
> windows API to convert from the native encoding if a high bit is set.
> This appears to be a valid assumption on everything except IBM
> machines with native ebcdic, which I'm ignoring since Clang won't
> compile anyways.
Yes, we don't care about EBCDIC. If someone comes around with a deep passion for it later, we can deal with it then.
> The main issue that I've run into is compatibility. My experimentation
> with gcc shows a lot of edge cases such as specifying a
> wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
> a string that violates the alignment.
I'm not sure what you mean here,
-Chris
More information about the cfe-dev
mailing list