[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Mon Jan 29 19:18:55 PST 2018

On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote:
>
> Hi, I have been investigating how to implementation -finput-charset 
> and -fexec-charset (including -fexec-wide-charset too). I noticed some 
> discussions a couple years ago and was planning on taking the same 
> approach. In a nutshell, change the source manager to convert the 
> input charset into UTF-8 and do the parsing using UTF-8 (eg. in 
> Lexer::InitLexer()). I would convert strings and character constants 
> into the exec charset when a consumer asks for the string literal. 
> This seems like a sound concept but there are many details that need 
> to be ironed out. The clang data structure is filled with all kinds of 
> strings (i.e file names, identifiers, literals). What charset should 
> be used when creating the clang AST's? Should getName() return the 
> name in UTF-8 or an output charset?
>

UTF-8; introducing another charset into the AST seems confusing for no 
benefit, given we're converting the input source code to UTF-8 anyway.  
The only place we need to translate symbol names to the execution 
charset is IR generation.

>
> While looking into this I realized that we need one more charset. We 
> have the input charset for the source code and exec charset for the 
> string literals. And we have the an internal charset (UTF-8) for 
> parsing. But we also need to have a charset for things like symbol 
> names and file names.
>

The charset for symbol names has to be the same as -fexec-charset for 
any target that has an API like dlsym().

File names do use a different charset, but LLVM has a file system layer 
which abstracts that, so clang should pretend the filesystem is UTF-8.  
(On Windows, we convert to UTF-16 before we call into the OS.  On other 
systems, we currently just assume everything is UTF-8, but we could 
change that if you need to run the compiler on a system where that 
doesn't hold.)
>
> We also need to consider messages. The messages may not be in the same 
> charset as the input charset or internal. We will need to consider 
> translation for messages and the substituted text.
>

Messages should be UTF-8 until we have to convert them for output. (An 
IDE always wants UTF-8.  For a console/stderr, we probably need some 
conversion, but IIRC that isn't implemented at the moment.)

Proprocessed output needs to be in the input charset; otherwise the 
compiler can't consume the result (for example, -save-temps would break).

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180129/6061def2/attachment.html>