[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Tue Jan 30 08:18:00 PST 2018

clang and llvm aren't performing any conversion right now.  Everything
assumes the input, output and exec charsets are UTF-8.  One user scenario I
am trying to enable is the input charset being EBCDIC for a system where
EBCDIC is the charset.  Doing this is non-trivial and exposes the issues I
outlined below and most likely more (eg. debug info).

The clang source is filled with code that is equivalent to "ch=='A'"  or "
ch>='A' && ch<='Z', where ch is a character from a source file.  These
examples have many assumptions.   Some of these are:
-  If the code is compiled on an EBCDIC system the character literal 'A'
and 'Z' will be in EBCDIC.
- the source files are also in EBCDIC.
- that the code points from 'A' to 'Z' are contiguous.  That is not the
case on EBCDIC.
We can solve many of the problems converting the source files into UTF-8
from EBCDIC and changing the comparisons above to "ch== u'A'" and "ch>=
u'A' and ch <= u'Z'".  Hence the internal charset is now really UTF-8.

With no other changes, the getName() family of functions in the clang AST
will return identiefer names in UTF-8.  This doesn't look very pretty or
helpfull in error messages or the internal dump functions.  When the input
charset is not from the same family as UTF-8 (i.e. ASCII based) then you
need to preform a lot of conversions to get useful output from the
compiler.  I'm proposing to have the getName() functions return the
declaration name in the output charset so we get intelligible output.

The same issue happens with include file names.  The llvm support library
provides a layer of encapsulation around files but it does not translate
the file names from UTF-8 to the system charset (eg. EBCDIC - IBM-1047).
It just uses the file names as they are.   In addition the clang code, in
the preprocessor, adds the "./" search path if needed.  The code points for
these characters have to be considered wisely so we don't mix charsets when
constructing file names.    We need to convert the internal UTF-8 to the
output charset (eg. EBCDIC) before trying to open files or displaying them
in error messages.  The easiest solution for this is to store the file
names in the source manager in the output charset.  That will avoid the
need to translate the names every time they are used.

On an EBCDIC system, the messages will be in EBCDIC, not UTF-8.  All output
is expected to be in EBCDIC.  We can't assume UTF-8 is used everywhere.

I had thought that the IR generator was the primary spot we would have to
do translation.  After prototyping and looking at clang, I found that most
of the places that would need translation were in clang itself.  Clang
generates things like messages, dumps, secondary output files (eg. make
depnd), etc.  These all need to translate symbol and file names before
generating the output.  As well, there are all of the tools that use the
clang AST.  If the clang AST stored names in UTF-8, these would need to
translate them too.  I think the easiest design is to store the symbol
names and the file names in the correct charset and not force the consumer
to do a conversion.

--
Sean Perry
Compiler Development
IBM Canada Lab
(905)-413-6031 (tie 313-6031), fax (905)-413-4839

From:	"Friedman, Eli" <efriedma at codeaurora.org>
To:	Sean Perry <perry at ca.ibm.com>, cfe-dev at lists.llvm.org
Date:	01/29/18 10:19 PM
Subject:	Re: [cfe-dev] Implementing charsets (-fexec-charset &
            -finput-charset)

On 1/29/2018 5:21 PM, Sean Perry via cfe-dev wrote:

      Hi, I have been investigating how to implementation -finput-charset
      and -fexec-charset (including -fexec-wide-charset too). I noticed
      some discussions a couple years ago and was planning on taking the
      same approach. In a nutshell, change the source manager to convert
      the input charset into UTF-8 and do the parsing using UTF-8 (eg. in
      Lexer::InitLexer()). I would convert strings and character constants
      into the exec charset when a consumer asks for the string literal.
      This seems like a sound concept but there are many details that need
      to be ironed out. The clang data structure is filled with all kinds
      of strings (i.e file names, identifiers, literals). What charset
      should be used when creating the clang AST's? Should getName() return
      the name in UTF-8 or an output charset?

UTF-8; introducing another charset into the AST seems confusing for no
benefit, given we're converting the input source code to UTF-8 anyway.  The
only place we need to translate symbol names to the execution charset is IR
generation.

      While looking into this I realized that we need one more charset. We
      have the input charset for the source code and exec charset for the
      string literals. And we have the an internal charset (UTF-8) for
      parsing. But we also need to have a charset for things like symbol
      names and file names.

The charset for symbol names has to be the same as -fexec-charset for any
target that has an API like dlsym().

File names do use a different charset, but LLVM has a file system layer
which abstracts that, so clang should pretend the filesystem is UTF-8.  (On
Windows, we convert to UTF-16 before we call into the OS.  On other
systems, we currently just assume everything is UTF-8, but we could change
that if you need to run the compiler on a system where that doesn't hold.)

      We also need to consider messages. The messages may not be in the
      same charset as the input charset or internal. We will need to
      consider translation for messages and the substituted text.

Messages should be UTF-8 until we have to convert them for output.  (An IDE
always wants UTF-8.  For a console/stderr, we probably need some
conversion, but IIRC that isn't implemented at the moment.)

Proprocessed output needs to be in the input charset; otherwise the
compiler can't consume the result (for example, -save-temps would break).

-Eli
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180130/be844c92/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180130/be844c92/attachment.gif>