[cfe-dev] RFC: Enabling fexec-charset support to LLVM and clang

Tom Honermann via cfe-dev cfe-dev at lists.llvm.org
Mon Dec 14 08:39:50 PST 2020

On 12/10/2020 8:59 AM, Abhina Sreeskantharajan via cfe-dev wrote:

We wish to implement the fexec-charset option to enable support for different execution character sets outside of the default UTF-8. Our proposal is to use UTF-8 as the internal charset. All input source files will be converted to this charset. One side effect is that the file buffer size may change after the translation between single and multi-byte character sets. This Phabricator patch https://urldefense.com/v3/__https://reviews.llvm.org/D93031__;!!A4F2R9G_pg!OuJ2_1ZvFHtfhljdBP-blYvsVd3FYKi5LL1v0hbh8OYpBmWQrdKjA-_NPoCj4F_l$  shows our initial implementation plan.

First, we create converters using the CharSetConverter class (https://urldefense.com/v3/__https://reviews.llvm.org/D88741__;!!A4F2R9G_pg!OuJ2_1ZvFHtfhljdBP-blYvsVd3FYKi5LL1v0hbh8OYpBmWQrdKjA-_NPk1e2im-$ ) for conversion between the internal charset and the system or execution charset, and vice versa. In the CharSetConverter class, some conversions are provided, but for the unsupported conversions, the class will attempt to create the converter using the iconv library if it exists. Since the iconv library differs between platforms, we cannot guarantee that the behaviour will be the same across all platforms.

Then during the parsing stage we translate the string literals using this converter. Translation cannot be performed after this stage because the context is lost and there is no difference between escaped characters and normal characters. The translated string will be shown in the IR readable format.

In addition, we wish to control translation for different types of strings. In our plan, we introduce an enum, ConversionState, to indicate which output codepage is needed. The table below summarizes what codepage different kinds of string literals should be in:
║Context                               │Output Codepage                  ║
║asm("...")                            │system charset (IBM-1047 on z/OS)║
║typeinfo name                         │system charset (IBM-1047 on z/OS)║
║#include "fn" or #include <fn>        │system charset (IBM-1047 on z/OS)║
║__FILE__ , __func__                   │-fexec-charset                   ║
║literal, string & char                │-fexec-charset                   ║
║user-defined literal                  │-fexec-charset                   ║
║extern "C" ...                        │n/a (internal charset, UTF-8)    ║
║_Pragma("message(...)”)               │n/a (internal charset, UTF-8)    ║
║attribute args (GNU, GCC)             │n/a (internal charset, UTF-8)    ║
║module map string literals (-fmodules)│n/a (internal charset, UTF-8)    ║
║line directive, digit directive       │n/a (internal charset, UTF-8)    ║

The text provided by std::type_info::name() is generally displayed in some way, often in conjunction with additional text encoded with the execution encoding.  I think this should follow -fexec-charset; that would be consistent with handling of __func__.

The messages provided for #error, static_assert, [[deprecated]], and [[nodiscard]] (following adoption of WG14 N2448<www.open-std.org/jtc1/sc22/wg14/www/docs/n2448.pdf> and WG21 P1301<https://wg21.link/p1301>) are another special case.  Options are to preserve the provided message in the internal encoding or, as mentioned below, to transcode from the execution encoding to the system encoding for diagnostic display.  Per WG14 N2563<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2563.pdf> and WG21 P2246<https://wg21.link/p2246>, either approach is acceptable, but the former would improve QoI.

Several complexities arise when string literals are inspected after translation. In these later stages of compilation, there is an underlying assumption that the literals are encoded in UTF-8 when in fact they can be encoded in one of many charsets. Listed below are three instances where this assumption can be found.

1. During printf/scanf format string validation, format specifiers in UTF-8 are searched for but are not found because the string will already be translated.
2. There are certain optimizations after the parsing stage (e.g. in SimplifyLibCalls.cpp, printf("done\n") gets optimized to puts("done") if the string ends in ‘\n’) which will not work if the string is already translated.
3. When generating messages (e.g. pragma message, warnings, errors) the message (usually in UTF-8) may be merged with string literals which may be translated, resulting in a string with mixed encoding.
4. When using VerifyDiagnosticConsumer to verify expected diagnostics, the expected text is in UTF-8 but the message can refer to string constants and literals which are translated and cannot match the expected text.

Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers  which are guaranteed to be correctly identified.

Any feedback on this implementation is welcome.


cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20201214/6a89c2f6/attachment-0001.html>

More information about the cfe-dev mailing list