[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Mon Jan 29 20:08:33 PST 2018


> On Jan 29, 2018, at 7:18 PM, Friedman, Eli via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote:
>> Hi, I have been investigating how to implementation -finput-charset and -fexec-charset (including -fexec-wide-charset too). I noticed some discussions a couple years ago and was planning on taking the same approach. In a nutshell, change the source manager to convert the input charset into UTF-8 and do the parsing using UTF-8 (eg. in Lexer::InitLexer()). I would convert strings and character constants into the exec charset when a consumer asks for the string literal. This seems like a sound concept but there are many details that need to be ironed out. The clang data structure is filled with all kinds of strings (i.e file names, identifiers, literals). What charset should be used when creating the clang AST's? Should getName() return the name in UTF-8 or an output charset? 
> 
> UTF-8; introducing another charset into the AST seems confusing for no benefit, given we're converting the input source code to UTF-8 anyway.  The only place we need to translate symbol names to the execution charset is IR generation.

+1

>> While looking into this I realized that we need one more charset. We have the input charset for the source code and exec charset for the string literals. And we have the an internal charset (UTF-8) for parsing. But we also need to have a charset for things like symbol names and file names.
>> 
> 
> The charset for symbol names has to be the same as -fexec-charset for any target that has an API like dlsym().
> 
> File names do use a different charset, but LLVM has a file system layer which abstracts that, so clang should pretend the filesystem is UTF-8.  (On Windows, we convert to UTF-16 before we call into the OS.  On other systems, we currently just assume everything is UTF-8, but we could change that if you need to run the compiler on a system where that doesn't hold.)

+2
>> We also need to consider messages. The messages may not be in the same charset as the input charset or internal. We will need to consider translation for messages and the substituted text. 
> 
> Messages should be UTF-8 until we have to convert them for output.  (An IDE always wants UTF-8.  For a console/stderr, we probably need some conversion, but IIRC that isn't implemented at the moment.)
> 
> Proprocessed output needs to be in the input charset; otherwise the compiler can't consume the result (for example, -save-temps would break).

+3 :-)

-Chris


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180129/392a91e3/attachment.html>