<html><body bgcolor="#FFFFFF"><p><font size="2">clang and llvm aren't performing any conversion right now. Everything assumes the input, output and exec charsets are UTF-8. One user scenario I am trying to enable is the input charset being EBCDIC for a system where EBCDIC is the charset. Doing this is non-trivial and exposes the issues I outlined below and most likely more (eg. debug info).</font><br><br><font size="2">The clang source is filled with code that is equivalent to "ch=='A'" or " ch>='A' && ch<='Z', where ch is a character from a source file. These examples have many assumptions. Some of these are:</font><br><font size="2">- If the code is compiled on an EBCDIC system the character literal 'A' and 'Z' will be in EBCDIC.</font><br><font size="2">- the source files are also in EBCDIC.</font><br><font size="2">- that the code points from 'A' to 'Z' are contiguous. That is not the case on EBCDIC.</font><br><font size="2">We can solve many of the problems converting the source files into UTF-8 from EBCDIC and changing the comparisons above to "ch== u'A'" and "ch>= u'A' and ch <= u'Z'". Hence the internal charset is now really UTF-8.</font><br><br><font size="2">With no other changes, the getName() family of functions in the clang AST will return identiefer names in UTF-8. This doesn't look very pretty or helpfull in error messages or the internal dump functions. When the input charset is not from the same family as UTF-8 (i.e. ASCII based) then you need to preform a lot of conversions to get useful output from the compiler. I'm proposing to have the getName() functions return the declaration name in the output charset so we get intelligible output.</font><br><br><font size="2">The same issue happens with include file names. The llvm support library provides a layer of encapsulation around files but it does not translate the file names from UTF-8 to the system charset (eg. EBCDIC - IBM-1047). It just uses the file names as they are. In addition the clang code, in the preprocessor, adds the "./" search path if needed. The code points for these characters have to be considered wisely so we don't mix charsets when constructing file names. We need to convert the internal UTF-8 to the output charset (eg. EBCDIC) before trying to open files or displaying them in error messages. The easiest solution for this is to store the file names in the source manager in the output charset. That will avoid the need to translate the names every time they are used.</font><br><font size="2"><br>On an EBCDIC system, the messages will be in EBCDIC, not UTF-8. All output is expected to be in EBCDIC. We can't assume UTF-8 is used everywhere. </font><br><br><font size="2">I had thought that the IR generator was the primary spot we would have to do translation. After prototyping and looking at clang, I found that most of the places that would need translation were in clang itself. Clang generates things like messages, dumps, secondary output files (eg. make depnd), etc. These all need to translate symbol and file names before generating the output. As well, there are all of the tools that use the clang AST. If the clang AST stored names in UTF-8, these would need to translate them too. I think the easiest design is to store the symbol names and the file names in the correct charset and not force the consumer to do a conversion. </font><br><font size="2"><br>--<br>Sean Perry<br>Compiler Development<br>IBM Canada Lab<br>(905)-413-6031 (tie 313-6031), fax (905)-413-4839<br></font><br><br><img width="16" height="16" src="cid:1__=8FBB08B6DFC6D83F8f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for "Friedman, Eli" ---01/29/2018 10:19:00 PM---On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote: >"><font size="2" color="#424282">"Friedman, Eli" ---01/29/2018 10:19:00 PM---On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote: ></font><br><br><font size="2" color="#5F5F5F">From: </font><font size="2">"Friedman, Eli" <efriedma@codeaurora.org></font><br><font size="2" color="#5F5F5F">To: </font><font size="2">Sean Perry <perry@ca.ibm.com>, cfe-dev@lists.llvm.org</font><br><font size="2" color="#5F5F5F">Date: </font><font size="2">01/29/18 10:19 PM</font><br><font size="2" color="#5F5F5F">Subject: </font><font size="2">Re: [cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br>On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote:
<ul><ul><font size="2">Hi, I have been investigating how to implementation -finput-charset and -fexec-charset (including -fexec-wide-charset too). I noticed some discussions a couple years ago and was planning on taking the same approach. In a nutshell, change the source manager to convert the input charset into UTF-8 and do the parsing using UTF-8 (eg. in Lexer::InitLexer()). I would convert strings and character constants into the exec charset when a consumer asks for the string literal. This seems like a sound concept but there are many details that need to be ironed out. The clang data structure is filled with all kinds of strings (i.e file names, identifiers, literals). What charset should be used when creating the clang AST's? Should getName() return the name in UTF-8 or an output charset? </font></ul></ul><br>UTF-8; introducing another charset into the AST seems confusing for no benefit, given we're converting the input source code to UTF-8 anyway. The only place we need to translate symbol names to the execution charset is IR generation.<br>
<ul><ul><font size="2"><br>While looking into this I realized that we need one more charset. We have the input charset for the source code and exec charset for the string literals. And we have the an internal charset (UTF-8) for parsing. But we also need to have a charset for things like symbol names and file names.</font></ul></ul><br>The charset for symbol names has to be the same as -fexec-charset for any target that has an API like dlsym().<br><br>File names do use a different charset, but LLVM has a file system layer which abstracts that, so clang should pretend the filesystem is UTF-8. (On Windows, we convert to UTF-16 before we call into the OS. On other systems, we currently just assume everything is UTF-8, but we could change that if you need to run the compiler on a system where that doesn't hold.)
<ul><ul><font size="2">We also need to consider messages. The messages may not be in the same charset as the input charset or internal. We will need to consider translation for messages and the substituted text. </font></ul></ul><br>Messages should be UTF-8 until we have to convert them for output. (An IDE always wants UTF-8. For a console/stderr, we probably need some conversion, but IIRC that isn't implemented at the moment.)<br><br>Proprocessed output needs to be in the input charset; otherwise the compiler can't consume the result (for example, -save-temps would break).<br><br>-Eli<br><tt>-- <br>Employee of Qualcomm Innovation Center, Inc.<br>Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</tt><br><br><BR>
</body></html>