[cfe-dev] RFC: Enabling fexec-charset support to LLVM and clang

Tue Dec 15 11:03:31 PST 2020

Thanks for your feedback. I hope to make subsequent patches after the current one to implement the correct TranslationStates for each context. Our reason for using system charset for typeinfo name was because the C++abi specifies a global variable to store the mangled type name which limits us to use one encoding. If we chose fexec-charset, we may have multiple charsets. Also, some basic typeinfo names are pregenerated in runtime (like int, float) which will already be in the system charset.

Thanks,
Abhina

-----Richard Smith <richard at metafoo.co.uk> wrote: -----
To: Tom Honermann <Thomas.Honermann at synopsys.com>
From: Richard Smith <richard at metafoo.co.uk>
Date: 12/15/2020 01:59AM
Cc: Abhina Sreeskantharajan <Abhina.Sreeskantharajan at ibm.com>, "cfe-dev at lists.llvm.org" <cfe-dev at lists.llvm.org>
Subject: [EXTERNAL] Re: [cfe-dev] RFC: Enabling fexec-charset support to LLVM and clang

                               On Mon, 14 Dec 2020 at 20:59, Tom Honermann ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
On Mon, 14 Dec 2020 at 20:59, Tom Honermann <Thomas.Honermann at synopsys.com> wrote:

On 12/14/2020 7:40 PM, Richard Smith wrote:

On Mon, 14 Dec 2020 at 08:40, Tom Honermann via cfe-dev <cfe-dev at lists.llvm.org> wrote:

On 12/10/2020 8:59 AM, Abhina Sreeskantharajan via cfe-dev wrote:

We wish to implement the fexec-charset option to enable support for different execution character sets outside of the default UTF-8. Our proposal is to use UTF-8 as the internal charset. All input source files will be converted to this charset. One side effect is that the file buffer size may change after the translation between single and multi-byte character sets. This Phabricator patch https://reviews.llvm.org/D93031  shows our initial implementation plan.

First, we create converters using the CharSetConverter class (https://reviews.llvm.org/D88741 ) for conversion between the internal charset and the system or execution charset, and vice versa. In the CharSetConverter class, some conversions are provided, but for the unsupported conversions, the class will attempt to create the converter using the iconv library if it exists. Since the iconv library differs between platforms, we cannot guarantee that the behaviour will be the same across all platforms.

Then during the parsing stage we translate the string literals using this converter. Translation cannot be performed after this stage because the context is lost and there is no difference between escaped characters and normal characters. The translated string will be shown in the IR readable format.

In addition, we wish to control translation for different types of strings. In our plan, we introduce an enum, ConversionState, to indicate which output codepage is needed. The table below summarizes what codepage different kinds of string literals should be in:
╔══════════════════════════════════════╤═════════════════════════════════╗
║Context                               │Output Codepage                  ║
╠══════════════════════════════════════╪═════════════════════════════════╣
║asm("...")                            │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║typeinfo name                         │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║#include "fn" or #include <fn>        │system charset (IBM-1047 on z/OS)║
╟──────────────────────────────────────┼─────────────────────────────────╢
║__FILE__ , __func__                   │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║literal, string & char                │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║user-defined literal                  │-fexec-charset                   ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║extern "C" ...                        │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║_Pragma("message(...)”)               │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║attribute args (GNU, GCC)             │n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║module map string literals (-fmodules)│n/a (internal charset, UTF-8)    ║
╟──────────────────────────────────────┼─────────────────────────────────╢
║line directive, digit directive       │n/a (internal charset, UTF-8)    ║
╚══════════════════════════════════════╧═════════════════════════════════╝

  The text provided by std::type_info::name() is generally displayed in some way, often in conjunction with additional text encoded with the execution encoding.  I think this should follow -fexec-charset; that would be consistent with handling of __func__.

The string returned here is assumed to be a mangled type name under both the Itanium ABI and the MS ABI, and is often passed by programs to the demangler / undecorator, so I think it should follow the encoding assumption of the mangling / name decoration  scheme, which is de facto UTF-8 for Itanium, and appears to also be UTF-8 under the MS ABI (even if the execution character set is set to something else:  https://godbolt.org/z/3Gx646). 

Similarly for the predefined string variables: __FUNCDNAME__ should be encoded in UTF-8 because it's a mangled name. The rest are human-readable text, so should be in the execution character set (wide execution character set for L__FUNCTION__ and L__FUNCSIG__).    I suspect UTF-8 wouldn't be an appropriate choice (not in general anyway) for the z/OS GOFF format, so perhaps the right answer for these cases is not to attempt to constrain beyond what C and C++ require; a NTBS.  The contents therefore have a target/ABI dependent  encoding (if they have one at all).

Yes, that's a good point. We can't necessarily assume that there even is a valid encoding for symbols / mangled names on arbitrary (perhaps not in-tree) targets. So we should take whatever sequence of octets comes out of the name mangler, add a trailing 0 byte, and use the result verbatim as the execution-time value. For Itanium and MS ABI that happens to always be a valid UTF-8 encoding, but that's happenstance, and arbitrary binary data (with no embedded 0 bytes, in order to make __FUNCDNAME__ and typeinfo strings usable) should be equally acceptable.

The output of the demangler should presumably be in the execution character set, but that's not our concern.    Agreed.
 Tom.

 The messages provided for #error,  static_assert, [[deprecated]], and  [[nodiscard]] (following adoption of  WG14 N2448 and  WG21 P1301) are another special case.  Options are to preserve the provided message in the internal encoding or, as mentioned below, to transcode from the execution encoding to the system encoding for diagnostic display.  Per  WG14 N2563 and  WG21 P2246, either approach is acceptable, but the former would improve QoI.

Several complexities arise when string literals are inspected after translation. In these later stages of compilation, there is an underlying assumption that the literals are encoded in UTF-8 when in fact they can be encoded in one of many charsets. Listed below are three instances where this assumption can be found.

1. During printf/scanf format string validation, format specifiers in UTF-8 are searched for but are not found because the string will already be translated.
2. There are certain optimizations after the parsing stage (e.g. in SimplifyLibCalls.cpp, printf("done\n") gets optimized to puts("done") if the string ends in ‘\n’) which will not work if the string is already translated.
3. When generating messages (e.g. pragma message, warnings, errors) the message (usually in UTF-8) may be merged with string literals which may be translated, resulting in a string with mixed encoding.
4. When using VerifyDiagnosticConsumer to verify expected diagnostics, the expected text is in UTF-8 but the message can refer to string constants and literals which are translated and cannot match the expected text.

Currently, we see no other way than to reverse the translation, disable this optimization or stop certain translations when the string is assumed to be encoded in UTF-8 to resolve these complexities. Although reversing translation may not yield the original string, it can be used to locate format specifiers  which are guaranteed to be correctly identified.

Any feedback on this implementation is welcome.

Thanks,
Abhina

_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev 
   _______________________________________________
 cfe-dev mailing list
 cfe-dev at lists.llvm.org
 https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev