[PATCH] D93031: Enable fexec-charset option

Tue Dec 15 08:40:34 PST 2020

abhina.sreeskantharajan marked 9 inline comments as done.
abhina.sreeskantharajan added a comment.

In D93031#2447230 <https://reviews.llvm.org/D93031#2447230>, @rsmith wrote:

> I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of `u8"..."` / `u"..."` / `U"..."` literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?
>
> We should use the original source form of the string literal when pretty-printing a `StringLiteral` or `CharacterLiteral`; there are a bunch of UTF-8 assumptions baked into `StmtPrinter` that will need revisiting. And we'll need to modify the handful of places that put the contents of `StringLiteral`s into diagnostics (`#warning`, `#error`, `static_assert`) and make them use a different `ConversionState`, since our assumption is that diagnostic output should be in UTF-8.

Yes, these are some of the complications we will need to visit in later patches. We may need to somehow save the original string or reverse the translation.

================
Comment at: clang/include/clang/Driver/Options.td:3583-3584

+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<codepage>">,
+  HelpText<"Set the execution <codepage> for string and character literals">;
 def target_cpu : Separate<["-"], "target-cpu">,
----------------
tahonermann wrote:
> How about substituting "character set", "character encoding", or "charset" for "codepage"?
> 
> This doesn't state what names are recognized.  The ones provided by the system iconv() implementation (as is the case for gcc)?  Or all names and aliases specified by the [[ https://www.iana.org/assignments/character-sets/character-sets.xhtml | IANA character set registry ]]?
> 
> The set of recognized names can be a superset of the names that are actually supported.
I've updated the description from codepage to charset.

It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.

================
Comment at: clang/include/clang/Lex/LiteralTranslator.h:32
+  static llvm::StringRef ExecCharset;
+  static llvm::StringMap<llvm::CharSetConverter> ExecCharsetTables;
+
----------------
rsmith wrote:
> Similarly, use of a global cache here will require you guard it with a mutex.
> 
> As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`?
Thanks, I've added an instance of LiteralTranslator to Preprocessor instead and use that when the Preprocessor is available. There is one constructor of StringLiteralParser that does not pass Preprocessor as an argument, so I had to create a LiteralTranslator instance there as well.

================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977
+  // Pass all -fexec-charset options to cc1.
+  std::vector<std::string> vList =
+      Args.getAllArgValues(options::OPT_fexec_charset_EQ);
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
----------------
tahonermann wrote:
> I think it would be preferable to diagnose an unrecognized character encoding name here if possible.  The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.
Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.

================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573
+    StringRef Value = ExecCharset->getValue();
+    Opts.ExecCharset = (std::string)Value;
+  }
----------------
tahonermann wrote:
> I wouldn't expect the cast to `std::string` to be needed here.
Without that cast, I get the following build error:
```
 error: no viable overloaded '='
```

================
Comment at: clang/test/Driver/cl-options.c:214
+// RUN: %clang_cl /execution-charset:iso8859-1 -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-iso8859-1 %s
+// execution-charset-iso8859-1-NOT: invalid value 'iso8859-1' in '/execution-charset:iso8859-1'
+
----------------
rsmith wrote:
> Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing.
You're right, I made a change just to make the testcase pass. I think this testcase is no longer needed because fexec-charset should be able to accept all charset names. We won't be able to diagnose invalid charset names until we actually try creating the CharSetConverter.

================
Comment at: llvm/lib/Support/Triple.cpp:1028-1029
+StringRef Triple::getSystemCharset() const {
+  if (getOS() == llvm::Triple::ZOS)
+    return "IBM-1047";
+  return "UTF-8";
----------------
tahonermann wrote:
> No support for targeting the z/OS Enhanced ASCII run-time?
We plan to support both modes in the future, but we want the default to still be IBM-1047 (EBCDIC). 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031