[PATCH] D93031: Enable fexec-charset option

Wed Dec 16 08:35:52 PST 2020

tahonermann added inline comments.

================
Comment at: clang/include/clang/Driver/Options.td:3583-3584

+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<codepage>">,
+  HelpText<"Set the execution <codepage> for string and character literals">;
 def target_cpu : Separate<["-"], "target-cpu">,
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > How about substituting "character set", "character encoding", or "charset" for "codepage"?
> > 
> > This doesn't state what names are recognized.  The ones provided by the system iconv() implementation (as is the case for gcc)?  Or all names and aliases specified by the [[ https://www.iana.org/assignments/character-sets/character-sets.xhtml | IANA character set registry ]]?
> > 
> > The set of recognized names can be a superset of the names that are actually supported.
> I've updated the description from codepage to charset.
> 
> It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.
Being dependent on the host iconv library seems fine by me; that is the case for gcc today.  I suggest making that explicit here:
  def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
    HelpText<"Set the execution <charset> for string and character literals.  Supported character encodings include XXX and those supported by the host iconv library.">;

================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977
+  // Pass all -fexec-charset options to cc1.
+  std::vector<std::string> vList =
+      Args.getAllArgValues(options::OPT_fexec_charset_EQ);
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > I think it would be preferable to diagnose an unrecognized character encoding name here if possible.  The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.
> Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.
Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here?

================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573
+    StringRef Value = ExecCharset->getValue();
+    Opts.ExecCharset = (std::string)Value;
+  }
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > I wouldn't expect the cast to `std::string` to be needed here.
> Without that cast, I get the following build error:
> ```
>  error: no viable overloaded '='
> ```
Ok, rather than a cast, I suggest:
  Opts.ExecCharset = Value.str();

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1322-1323
+  TranslationState = translationState;
+  if (Kind == tok::wide_string_literal)
+    TranslationState = TranslateToSystemCharset;
+  else if (isUTFLiteral(Kind))
----------------
Converting wide character literals to the system encoding doesn't seem right to me.  For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1593-1597
+  ConversionState State = TranslationState;
+  if (Kind == tok::wide_string_literal)
+    State = TranslateToSystemCharset;
+  else if (isUTFLiteral(Kind))
+    State = NoTranslation;
----------------
The stored `TranslationState` should not be completely ignored for wide and UTF string literals.  The standard permits things like the following.
  #pragma rigoot L"bozit"
  #pragma rigoot u"bozit"
  _Pragma(L"rigoot bozit")
  _Pragma(u8"rigoot bozit")
For at least the `_Pragma(L"...")` case, the C++ standard [[ http://eel.is/c++draft/cpp.pragma.op | states ]] the `L` is ignored, but it doesn't say anything about other encoding prefixes.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1594-1595
+  ConversionState State = TranslationState;
+  if (Kind == tok::wide_string_literal)
+    State = TranslateToSystemCharset;
+  else if (isUTFLiteral(Kind))
----------------
Converting wide string literals to the system encoding doesn't seem right to me.  For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031