[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Sean Perry via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 29 17:21:39 PST 2018



Hi, I have been investigating how to implementation -finput-charset and
-fexec-charset (including -fexec-wide-charset too).  I noticed some
discussions a couple years ago and was planning on taking the same
approach.  In a nutshell, change the source manager to convert the input
charset into UTF-8 and do the parsing using UTF-8 (eg. in Lexer::InitLexer
()).  I would convert strings and character constants into the exec charset
when a consumer asks for the string literal.  This seems like a sound
concept but there are many details that need to be ironed out.  The clang
data structure is filled with all kinds of strings (i.e file names,
identifiers, literals).  What charset should be used when creating the
clang AST's?  Should getName() return the name in UTF-8 or an output
charset?

While looking into this I realized that we need one more charset.  We have
the input charset for the source code and exec charset for the string
literals.  And we have the an internal charset (UTF-8) for parsing.  But we
also need to have a charset for things like symbol names and file names.
These do not use the input or internal charsets.  For example, on MVS the
user may say the input charset is ASCII or UTF-8 but actual file names and
symbol names in the object file need to be EBCDIC.  The same would be true
for alternative charsets on Linux.  A code point in the a locale other that
en_US may map to a difference code point in the file system.   The other
charset is the output charset.  This is the charset that symbol names in
the object file should use as well as the charset for file names.

We also need to consider messages.  The messages may not be in the same
charset as the input charset or internal.  We will need to consider
translation for messages and the substituted text.

I have thought about some of these issues and would like feedback and/or
suggestions.

1) Source file names
- We'd store these in the SourceManager in the output charset.
- When preprocessing (#include, etc) we would convert the file names into
the output charset and do all file name building and system calls using the
output charset

2) Identifiers
-  I think the getName() function on IdentifierInfo and similar functions
should return the name in the output charset  Too many places, even in
clang, use the getName() functions and would need to apply a translation if
we didn't do this
- We need some way to make parsing quick since identifiers will start off
in UTF-8 and we won't be able to use getName() to look up identifiers any
more.  I was thinking about adding a getNameInternal() that would return
the UTF-8 spelling and would be used in the hashing.

3) String literals & Character constants
- these are converted to the exec charset and stored in the clang structure
in the translated format

4) Messages & trace information
- Going to assume the charset for messages is a variation of the output
charset.
- All substitution text should be in or converted into the output charset
before generating the diagnostic message.
- trace/dump output will be in the output charset too.

5) Preprocessed output (including the make depndency rules)
- All preprocessed output will be in the output charset

Thanks
--
Sean Perry
Compiler Development
IBM Canada Lab
(905)-413-6031 (tie 313-6031), fax (905)-413-4839
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180129/d4829ed2/attachment.html>


More information about the cfe-dev mailing list