[cfe-dev] RFC: Enabling fexec-charset support to LLVM and clang

Thu Dec 10 05:59:50 PST 2020

We wish to implement the fexec-charset option to enable support for different execution character sets outside of the default UTF-8. Our proposal is to use UTF-8 as the internal charset. All input source files will be converted to this charset. One side effect is that the file buffer size may change after the translation between single and multi-byte character sets. This Phabricator patch https://reviews.llvm.org/D93031 shows our initial implementation plan.

First, we create converters using the CharSetConverter class (https://reviews.llvm.org/D88741) for conversion between the internal charset and the system or execution charset, and vice versa. In the CharSetConverter class, some conversions are provided, but for the unsupported conversions, the class will attempt to create the converter using the iconv library if it exists. Since the iconv library differs between platforms, we cannot guarantee that the behaviour will be the same across all platforms.

Then during the parsing stage we translate the string literals using this converter. Translation cannot be performed after this stage because the context is lost and there is no difference between escaped characters and normal characters. The translated string will be shown in the IR readable format.

In addition, we wish to control translation for different types of strings. In our plan, we introduce an enum, ConversionState, to indicate which output codepage is needed. The table below summarizes what codepage different kinds of string literals should be in:
╔══════════════════════════════════════╤═════════════════════════════════╗
║Context                               │Output Codepage                  ║
╠══════════════════════════════════════╪═════════════════════════════════╣
║asm("...")                            │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║typeinfo name                         │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║#include "fn" or #include <fn>        │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║__FILE__ , __func__                   │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║literal, string & char                │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║user-defined literal                  │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║extern "C" ...                        │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║_Pragma("message(...)”)               │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║attribute args (GNU, GCC)             │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║module map string literals (-fmodules)│n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║line directive, digit directive       │n/a (internal charset, UTF-8)    ║
╚══════════════════════════════════════╧═════════════════════════════════╝

Several complexities arise when string literals are inspected after translation. In these later stages of compilation, there is an underlying assumption that the literals are encoded in UTF-8 when in fact they can be encoded in one of many charsets. Listed below are three instances where this assumption can be found.

1. During printf/scanf format string validation, format specifiers in UTF-8 are searched for but are not found because the string will already be translated.
2. There are certain optimizations after the parsing stage (e.g. in SimplifyLibCalls.cpp, printf("done\n") gets optimized to puts("done") if the string ends in ‘\n’) which will not work if the string is already translated.
3. When generating messages (e.g. pragma message, warnings, errors) the message (usually in UTF-8) may be merged with string literals which may be translated, resulting in a string with mixed encoding.
4. When using VerifyDiagnosticConsumer to verify expected diagnostics, the expected text is in UTF-8 but the message can refer to string constants and literals which are translated and cannot match the expected text.

Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers  which are guaranteed to be correctly identified.

Any feedback on this implementation is welcome.

Thanks,
Abhina