[PATCH] D93031: Enable fexec-charset option

Thu Dec 10 15:11:12 PST 2020

rsmith added a comment.

I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of `u8"..."` / `u"..."` / `U"..."` literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?

We should use the original source form of the string literal when pretty-printing a `StringLiteral` or `CharacterLiteral`; there are a bunch of UTF-8 assumptions baked into `StmtPrinter` that will need revisiting. And we'll need to modify the handful of places that put the contents of `StringLiteral`s into diagnostics (`#warning`, `#error`, `static_assert`) and make them use a different `ConversionState`, since our assumption is that diagnostic output should be in UTF-8.

================
Comment at: clang/include/clang/Lex/LiteralTranslator.h:29-31
+  static llvm::StringRef InternalCharset;
+  static llvm::StringRef SystemCharset;
+  static llvm::StringRef ExecCharset;
----------------
It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example.

================
Comment at: clang/include/clang/Lex/LiteralTranslator.h:32
+  static llvm::StringRef ExecCharset;
+  static llvm::StringMap<llvm::CharSetConverter> ExecCharsetTables;
+
----------------
Similarly, use of a global cache here will require you guard it with a mutex.

As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`?

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239
+  if (Translate && Converter) {
+    // ResultChar is either UTF-8 or ASCII literal and can only be converted
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:236
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
Reinterpreting an `unsigned*` as a `char*` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+      void *Pointer = &ResultChar;
+      memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned));
+    }
----------------
What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1321
+  llvm::CharSetConverter *Converter =
+      StringLiteralParser::Translator.getCharConversionTable(TranslationState);
+
----------------
Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?)

================
Comment at: clang/test/Driver/cl-options.c:214
+// RUN: %clang_cl /execution-charset:iso8859-1 -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-iso8859-1 %s
+// execution-charset-iso8859-1-NOT: invalid value 'iso8859-1' in '/execution-charset:iso8859-1'
+
----------------
Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing.

================
Comment at: clang/test/Driver/clang_f_opts.c:213
+// RUN: %clang -### -S -fexec-charset=iso8859-1 -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-VALID-INPUT-CHARSET %s
+// CHECK-VALID-INPUT-CHARSET-NOT: error: invalid value 'iso8859-1' in '-fexec-charset=iso8859-1'

----------------
Again, this is not a useful test.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031