[PATCH] D93031: Enable fexec-charset option

Mon Dec 21 08:25:39 PST 2020

abhina.sreeskantharajan marked 11 inline comments as done.
abhina.sreeskantharajan added inline comments.

================
Comment at: clang/include/clang/Driver/Options.td:3583-3584

+def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<codepage>">,
+  HelpText<"Set the execution <codepage> for string and character literals">;
 def target_cpu : Separate<["-"], "target-cpu">,
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > How about substituting "character set", "character encoding", or "charset" for "codepage"?
> > > 
> > > This doesn't state what names are recognized.  The ones provided by the system iconv() implementation (as is the case for gcc)?  Or all names and aliases specified by the [[ https://www.iana.org/assignments/character-sets/character-sets.xhtml | IANA character set registry ]]?
> > > 
> > > The set of recognized names can be a superset of the names that are actually supported.
> > I've updated the description from codepage to charset.
> > 
> > It's hard to specify what charsets are supported because iconv library differs between targets, so the list will not be the same on every platform.
> Being dependent on the host iconv library seems fine by me; that is the case for gcc today.  I suggest making that explicit here:
>   def fexec_charset : Separate<["-"], "fexec-charset">, MetaVarName<"<charset>">,
>     HelpText<"Set the execution <charset> for string and character literals.  Supported character encodings include XXX and those supported by the host iconv library.">;
I've updated the HelpText with your suggested description.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:192
+                    ConversionState translationState = TranslateToExecCharset);
+  ConversionState TranslationState;

----------------
tahonermann wrote:
> Does the conversion state need to be persisted as a data member?  The literal is consumed in the constructor.
Thanks, I've removed this.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:244
   bool Pascal;
+  ConversionState TranslationState;
+
----------------
tahonermann wrote:
> Same concern here with respect to persisting the conversion state as a data member.
If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?

================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5969-5977
+  // Pass all -fexec-charset options to cc1.
+  std::vector<std::string> vList =
+      Args.getAllArgValues(options::OPT_fexec_charset_EQ);
+  // Set the default fexec-charset as the system charset.
+  CmdArgs.push_back("-fexec-charset");
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > I think it would be preferable to diagnose an unrecognized character encoding name here if possible.  The current changes will result in an unrecognized name (as opposed to one that is unsupported for the target) being diagnosed for each compiler instance.
> > Since we do not know what charsets are supported by the iconv library on the target platform, we don't know what charsets are actually invalid until we try creating a CharSetConverter.
> Understood, but what would be the harm in performing a lookup (constructing a `CharSetConverter`) here?
I initially thought it will be a performance issue if we are creating the Converter twice, once here and once in the Preprocessor. But I do think its a good idea to diagnose this early. I've modified the code to diagnose and error here.

================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:3573
+    StringRef Value = ExecCharset->getValue();
+    Opts.ExecCharset = (std::string)Value;
+  }
----------------
tahonermann wrote:
> abhina.sreeskantharajan wrote:
> > tahonermann wrote:
> > > I wouldn't expect the cast to `std::string` to be needed here.
> > Without that cast, I get the following build error:
> > ```
> >  error: no viable overloaded '='
> > ```
> Ok, rather than a cast, I suggest:
>   Opts.ExecCharset = Value.str();
Thanks, I've applied this change.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239
+  if (Translate && Converter) {
+    // ResultChar is either UTF-8 or ASCII literal and can only be converted
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
tahonermann wrote:
> rsmith wrote:
> > Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct.
> For stateful encodings, I can imagine that state would have to be transitioned to the initial state before translating the escape sequence.  I suspect support for stateful encodings is not a goal at this time.
Right, stateful encodings may be a problem we will need to revisit later as well.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
----------------
tahonermann wrote:
> What should happen if `ResultChar` >= 0x100?  IBM-1047 does have representation for other UTF-8 characters.  Regardless, it seems `ResultChar` should be converted to something.
This is no longer valid, thanks for catching that. We were initially translating to ASCII instead of UTF-8 so we needed to guard against larger characters. I've removed this guard since the internal charset is UTF-8.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:236
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
rsmith wrote:
> Reinterpreting an `unsigned*` as a `char*` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead.
Thanks, I've created a char instead.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+      void *Pointer = &ResultChar;
+      memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned));
+    }
----------------
rsmith wrote:
> What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.
This may be a problem we need to revisit since ResultChar is expecting a char.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1742-1745
           EncodeUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd,
                           ResultPtr, hadError,
                           FullSourceLoc(StringToks[i].getLocation(), SM),
                           CharByteWidth, Diags, Features);
----------------
tahonermann wrote:
> UCNs will require conversion here.
I've added code to translate UCN characters and have updated the testcase as well.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031