[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)
Friedman, Eli via cfe-dev
cfe-dev at lists.llvm.org
Mon Jan 29 19:18:55 PST 2018
On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote:
>
> Hi, I have been investigating how to implementation -finput-charset
> and -fexec-charset (including -fexec-wide-charset too). I noticed some
> discussions a couple years ago and was planning on taking the same
> approach. In a nutshell, change the source manager to convert the
> input charset into UTF-8 and do the parsing using UTF-8 (eg. in
> Lexer::InitLexer()). I would convert strings and character constants
> into the exec charset when a consumer asks for the string literal.
> This seems like a sound concept but there are many details that need
> to be ironed out. The clang data structure is filled with all kinds of
> strings (i.e file names, identifiers, literals). What charset should
> be used when creating the clang AST's? Should getName() return the
> name in UTF-8 or an output charset?
>
UTF-8; introducing another charset into the AST seems confusing for no
benefit, given we're converting the input source code to UTF-8 anyway.
The only place we need to translate symbol names to the execution
charset is IR generation.
>
> While looking into this I realized that we need one more charset. We
> have the input charset for the source code and exec charset for the
> string literals. And we have the an internal charset (UTF-8) for
> parsing. But we also need to have a charset for things like symbol
> names and file names.
>
The charset for symbol names has to be the same as -fexec-charset for
any target that has an API like dlsym().
File names do use a different charset, but LLVM has a file system layer
which abstracts that, so clang should pretend the filesystem is UTF-8.
(On Windows, we convert to UTF-16 before we call into the OS. On other
systems, we currently just assume everything is UTF-8, but we could
change that if you need to run the compiler on a system where that
doesn't hold.)
>
> We also need to consider messages. The messages may not be in the same
> charset as the input charset or internal. We will need to consider
> translation for messages and the substituted text.
>
Messages should be UTF-8 until we have to convert them for output. (An
IDE always wants UTF-8. For a console/stderr, we probably need some
conversion, but IIRC that isn't implemented at the moment.)
Proprocessed output needs to be in the input charset; otherwise the
compiler can't consume the result (for example, -save-temps would break).
-Eli
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180129/6061def2/attachment.html>
More information about the cfe-dev
mailing list