[cfe-dev] Supporting -finput-charset, -fexec-charset and -fwide-exec-charset?

Chris Lattner clattner at apple.com
Tue Jun 14 23:01:46 PDT 2011


On Jun 12, 2011, at 9:59 PM, Scott Conger wrote:
> I'm new to clang. I've been looking at adding support for
> -finput-charset, -fexec-charset and -fwide-exec-charset. I took at
> look through the mailing list archives and code, and I haven't seen a
> lot of discussion of this except in a general sense. Has anyone taken
> a more serious look at this?

Hi Scott,

It would be great for you to tackle this.  Some people (including me) have thought about this a bit, but no specific work has started. Your assessment of how things work (everything ASCII) is right on target.

I'd suggest starting with this approach:

1. Make the compiler fully UTF8 clean and happy.  This is moderately easy, the only major concern is that the lexer is highly performance sensitive.  Your plan makes sense to me.
2. Introduce support for UCN's.
3. Add support for specifying/detecting input charsets (e.g. through BOMs).

Part #3 can be handled in several ways.  The best way to start is to have SourceManager detect that files need to be remapped when opened, and just rewrite the entire input buffer into UTF8.  This way we only pay a performance hit when dealing with files in strange encodings.  For (common!) single byte encodings that map 0-127 onto normal ascii characters, SourceManager can scan the file and if there are no high characters, then no remapping is required.

-Chris



More information about the cfe-dev mailing list