[PATCH] D93031: Enable fexec-charset option

Wed Dec 23 22:04:25 PST 2020

tahonermann added inline comments.

================
Comment at: clang/include/clang/Lex/LiteralSupport.h:244
   bool Pascal;
+  ConversionState TranslationState;
+
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > Same concern here with respect to persisting the conversion state as a data member.
> If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?
I think so, yes.  Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.

================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5982-5984
+    } else {
+      D.Diag(clang::diag::err_drv_invalid_value) << "-fexec-charset" << *it;
+    }
----------------
Thank you for adding this.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+    SmallString<8> ResultCharConv;
+    Converter->convert(std::string(1, ByteChar), ResultCharConv);
+    memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
----------------
Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence.  Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:235
+    Converter->convert(std::string(1, ByteChar), ResultCharConv);
+    memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
+  }
----------------
As Richard previously noted, this `memcpy()` needs to be addressed.  The intended behavior here is not clear.  Are there valid scenarios in which the conversion will produce a sequence of more than one code units?  I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1322-1323
+  TranslationState = translationState;
+  if (Kind == tok::wide_string_literal)
+    TranslationState = TranslateToSystemCharset;
+  else if (isUTFLiteral(Kind))
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > Converting wide character literals to the system encoding doesn't seem right to me.  For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).
> Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?
Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?

================
Comment at: clang/test/CodeGen/systemz-charset.c:16-23
+char *EscapeCharacters = "\a\b\f\n\r\t\v\\\'\"\?";
+//CHECK: c"/\16\0C\15\0D\05\0B\E0}\7Fo\00"
+
+char *HexCharacters = "\x12\x13\x14";
+//CHECK: c"\12\13\14\00"
+
+char *OctalCharacters = "\141\142\143";
----------------
`const char*` here too please.

================
Comment at: clang/test/CodeGen/systemz-charset.cpp:1-25
+// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s
+
+const char *RawString = R"(Hello\n)";
+//CHECK: c"\C8\85\93\93\96\E0\95\00"
+
+char UnicodeChar8 = u8'1';
+//CHECK: i8 49
----------------
This is good.  I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.

================
Comment at: clang/test/Driver/clang_f_opts.c:212-213

-// RUN: %clang -### -S -fexec-charset=iso-8859-1 -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
-// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'iso-8859-1' in '-fexec-charset=iso-8859-1'
+// RUN: %clang -### -S -fexec-charset=invalid-charset -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
+// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'invalid-charset' in '-fexec-charset'

----------------
This looks good.  Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031