[cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Mon Jan 29 18:07:30 PST 2018

And I thought Windows was complicated...
Debug info (at least for DWARF) wants strings to be UTF-8; filenames, type/variable names, all that stuff.  Just to throw that out there.
--paulr

From: cfe-dev [mailto:cfe-dev-bounces at lists.llvm.org] On Behalf Of Sean Perry via cfe-dev
Sent: Monday, January 29, 2018 5:22 PM
To: cfe-dev at lists.llvm.org
Subject: [cfe-dev] Implementing charsets (-fexec-charset & -finput-charset)

Hi, I have been investigating how to implementation -finput-charset and -fexec-charset (including -fexec-wide-charset too). I noticed some discussions a couple years ago and was planning on taking the same approach. In a nutshell, change the source manager to convert the input charset into UTF-8 and do the parsing using UTF-8 (eg. in Lexer::InitLexer()). I would convert strings and character constants into the exec charset when a consumer asks for the string literal. This seems like a sound concept but there are many details that need to be ironed out. The clang data structure is filled with all kinds of strings (i.e file names, identifiers, literals). What charset should be used when creating the clang AST's? Should getName() return the name in UTF-8 or an output charset?

While looking into this I realized that we need one more charset. We have the input charset for the source code and exec charset for the string literals. And we have the an internal charset (UTF-8) for parsing. But we also need to have a charset for things like symbol names and file names. These do not use the input or internal charsets. For example, on MVS the user may say the input charset is ASCII or UTF-8 but actual file names and symbol names in the object file need to be EBCDIC. The same would be true for alternative charsets on Linux. A code point in the a locale other that en_US may map to a difference code point in the file system. The other charset is the output charset. This is the charset that symbol names in the object file should use as well as the charset for file names.

We also need to consider messages. The messages may not be in the same charset as the input charset or internal. We will need to consider translation for messages and the substituted text.

I have thought about some of these issues and would like feedback and/or suggestions.

1) Source file names
- We'd store these in the SourceManager in the output charset.
- When preprocessing (#include, etc) we would convert the file names into the output charset and do all file name building and system calls using the output charset

2) Identifiers
- I think the getName() function on IdentifierInfo and similar functions should return the name in the output charset Too many places, even in clang, use the getName() functions and would need to apply a translation if we didn't do this
- We need some way to make parsing quick since identifiers will start off in UTF-8 and we won't be able to use getName() to look up identifiers any more. I was thinking about adding a getNameInternal() that would return the UTF-8 spelling and would be used in the hashing.

3) String literals & Character constants
- these are converted to the exec charset and stored in the clang structure in the translated format

4) Messages & trace information
- Going to assume the charset for messages is a variation of the output charset.
- All substitution text should be in or converted into the output charset before generating the diagnostic message.
- trace/dump output will be in the output charset too.

5) Preprocessed output (including the make depndency rules)
- All preprocessed output will be in the output charset

Thanks
--
Sean Perry
Compiler Development
IBM Canada Lab
(905)-413-6031 (tie 313-6031), fax (905)-413-4839
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180130/8ec8ca7d/attachment.html>