[PATCH] D93031: Enable fexec-charset option

Fri Dec 11 23:14:58 PST 2020

tahonermann added inline comments.

================
Comment at: clang/include/clang/Driver/Options.td:3583-3584

+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<codepage>">,
+  HelpText<"Set the execution <codepage> for string and character literals">;
 def target_cpu : Separate<["-"], "target-cpu">,
----------------
How about substituting "character set", "character encoding", or "charset" for "codepage"?

This doesn't state what names are recognized.  The ones provided by the system iconv() implementation (as is the case for gcc)?  Or all names and aliases specified by the [[ https://www.iana.org/assignments/character-sets/character-sets.xhtml | IANA character set registry ]]?

The set of recognized names can be a superset of the names that are actually supported.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:192
+                    ConversionState translationState = TranslateToExecCharset);
+  ConversionState TranslationState;

----------------
Does the conversion state need to be persisted as a data member?  The literal is consumed in the constructor.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:244
   bool Pascal;
+  ConversionState TranslationState;
+
----------------
Same concern here with respect to persisting the conversion state as a data member.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:246
+
+  static LiteralTranslator Translator;

----------------
This static data member will presumably need to be lifted to per-instance state as Richard mentioned elsewhere.

================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977
+  // Pass all -fexec-charset options to cc1.
+  std::vector<std::string> vList =
+      Args.getAllArgValues(options::OPT_fexec_charset_EQ);
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
----------------
I think it would be preferable to diagnose an unrecognized character encoding name here if possible.  The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.

================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573
+    StringRef Value = ExecCharset->getValue();
+    Opts.ExecCharset = (std::string)Value;
+  }
----------------
I wouldn't expect the cast to `std::string` to be needed here.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239
+  if (Translate && Converter) {
+    // ResultChar is either UTF-8 or ASCII literal and can only be converted
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
rsmith wrote:
> Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.
For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence.  I suspect support for stateful encodings is not a goal at this time.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
----------------
What should happen if `ResultChar` >= 0x100?  IBM-1047 does have representation for other UTF-8 characters.  Regardless, it seems `ResultChar` should be converted to something.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1742-1745
           EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd,
                           ResultPtr, hadError,
                           FullSourceLoc(StringToks[i].getLocation(), SM),
                           CharByteWidth, Diags, Features);
----------------
UCNs will require conversion here.

================
Comment at: llvm/lib/Support/Triple.cpp:1028-1029
+StringRef Triple::getSystemCharset() const {
+  if (getOS() == llvm::Triple::ZOS)
+    return "IBM-1047";
+  return "UTF-8";
----------------
No support for targeting the z/OS Enhanced ASCII run-time?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031