<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 1/29/2018 5:21 PM, Sean Perry via
cfe-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">
<p><font size="2">Hi, I have been investigating how to
implementation -finput-charset and -fexec-charset (including
-fexec-wide-charset too). I noticed some discussions a couple
years ago and was planning on taking the same approach. In a
nutshell, change the source manager to convert the input
charset into UTF-8 and do the parsing using UTF-8 (eg. in
Lexer::InitLexer()). I would convert strings and character
constants into the exec charset when a consumer asks for the
string literal. This seems like a sound concept but there are
many details that need to be ironed out. The clang data
structure is filled with all kinds of strings (i.e file names,
identifiers, literals). What charset should be used when
creating the clang AST's? Should getName() return the name in
UTF-8 or an output charset? </font><br>
</p>
</blockquote>
<br>
UTF-8; introducing another charset into the AST seems confusing for
no benefit, given we're converting the input source code to UTF-8
anyway. The only place we need to translate symbol names to the
execution charset is IR generation.<br>
<br>
<blockquote type="cite"
cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">
<p><br>
<font size="2">While looking into this I realized that we need
one more charset. We have the input charset for the source
code and exec charset for the string literals. And we have the
an internal charset (UTF-8) for parsing. But we also need to
have a charset for things like symbol names and file names.</font></p>
</blockquote>
<br>
The charset for symbol names has to be the same as -fexec-charset
for any target that has an API like dlsym().<br>
<br>
File names do use a different charset, but LLVM has a file system
layer which abstracts that, so clang should pretend the filesystem
is UTF-8. (On Windows, we convert to UTF-16 before we call into the
OS. On other systems, we currently just assume everything is UTF-8,
but we could change that if you need to run the compiler on a system
where that doesn't hold.)<br>
<blockquote type="cite"
cite="mid:OF852B0509.2E8DCF9B-ON00258225.0001AC30-85258225.000779F2@notes.na.collabserv.com">
<p><font size="2">We also need to consider messages. The messages
may not be in the same charset as the input charset or
internal. We will need to consider translation for messages
and the substituted text. </font><br>
</p>
</blockquote>
<br>
Messages should be UTF-8 until we have to convert them for output.
(An IDE always wants UTF-8. For a console/stderr, we probably need
some conversion, but IIRC that isn't implemented at the moment.)<br>
<br>
Proprocessed output needs to be in the input charset; otherwise the
compiler can't consume the result (for example, -save-temps would
break).<br>
<br>
-Eli<br>
<pre class="moz-signature" cols="72">--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>
</body>
</html>