[cfe-dev] Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Wed Jun 15 00:10:34 PDT 2011

Thanks for the reply Chris.

I was going to put off universal-character-names for now. It should be
easy to add afterward.

For the BOM and input character sets the general scheme I have at the moment is:

* Check for BOM (warning if it contradicts the inputcharset option)
* If inputcharset option is UTF-8, the locale specified encoding is
UTF-8 or there is a UTF-8 BOM, just validate the input (performance
hit later on if there can be invalid UTF-8)
* If user specified a non-UTF-8 inputcharset, use iconv to convert
(ignoring the BOM, which might be a false positive)
* For other BOM, use iconv to convert

The fallback is to check if every byte is < 128, using iconv or the
windows API to convert from the native encoding if a high bit is set.
This appears to be a valid assumption on everything except IBM
machines with native ebcdic, which I'm ignoring since Clang won't
compile anyways.

The main issue that I've run into is compatibility. My experimentation
with gcc shows a lot of edge cases such as specifying a
wide-exec-charset that is some 8 bit encoding, or putting octal/hex in
a string that violates the alignment.

-Scott

On Tue, Jun 14, 2011 at 11:01 PM, Chris Lattner <clattner at apple.com> wrote:
> On Jun 12, 2011, at 9:59 PM, Scott Conger wrote:
>> I'm new to clang. I've been looking at adding support for
>> -finput-charset, -fexec-charset and -fwide-exec-charset. I took at
>> look through the mailing list archives and code, and I haven't seen a
>> lot of discussion of this except in a general sense. Has anyone taken
>> a more serious look at this?
>
> Hi Scott,
>
> It would be great for you to tackle this.  Some people (including me) have thought about this a bit, but no specific work has started. Your assessment of how things work (everything ASCII) is right on target.
>
> I'd suggest starting with this approach:
>
> 1. Make the compiler fully UTF8 clean and happy.  This is moderately easy, the only major concern is that the lexer is highly performance sensitive.  Your plan makes sense to me.
> 2. Introduce support for UCN's.
> 3. Add support for specifying/detecting input charsets (e.g. through BOMs).
>
> Part #3 can be handled in several ways.  The best way to start is to have SourceManager detect that files need to be remapped when opened, and just rewrite the entire input buffer into UTF8.  This way we only pay a performance hit when dealing with files in strange encodings.  For (common!) single byte encodings that map 0-127 onto normal ascii characters, SourceManager can scan the file and if there are no high characters, then no remapping is required.
>
> -Chris
>