[PATCH] D93031: Enable fexec-charset option
Tom Honermann via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Dec 23 22:04:25 PST 2020
tahonermann added inline comments.
================
Comment at: clang/include/clang/Lex/LiteralSupport.h:244
bool Pascal;
+ ConversionState TranslationState;
+
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > Same concern here with respect to persisting the conversion state as a data member.
> If this member is removed in StringLiteralParser, we will need to pass the State to multiple functions in StringLiteralParser like init(). Would this solution be preferable to keeping a data member?
I think so, yes. Data members should be used to reflect the state of the object, not as a convenient mechanism to avoid passing arguments.
================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:5982-5984
+ } else {
+ D.Diag(clang::diag::err_drv_invalid_value) << "-fexec-charset" << *it;
+ }
----------------
Thank you for adding this.
================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+ SmallString<8> ResultCharConv;
+ Converter->convert(std::string(1, ByteChar), ResultCharConv);
+ memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
----------------
Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence. Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me.
================
Comment at: clang/lib/Lex/LiteralSupport.cpp:235
+ Converter->convert(std::string(1, ByteChar), ResultCharConv);
+ memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
+ }
----------------
As Richard previously noted, this `memcpy()` needs to be addressed. The intended behavior here is not clear. Are there valid scenarios in which the conversion will produce a sequence of more than one code units? I believe the input is limited to ASCII characters and invalid code units (e.g., a lead byte of a UTF-8 sequence) and in the latter case, an error and/or substitution of a `?` (in the execution encoding) seem like acceptable behaviors to me.
================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1322-1323
+ TranslationState = translationState;
+ if (Kind == tok::wide_string_literal)
+ TranslationState = TranslateToSystemCharset;
+ else if (isUTFLiteral(Kind))
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > Converting wide character literals to the system encoding doesn't seem right to me. For z/OS, this should presumably convert to the wide EBCDIC encoding, but for all other supported platforms, the wide execution character set is either UTF-16 or UTF-32 depending on the size of `wchar_t` (which may be influenced by the `-fshort-wchar` option).
> Since we don't implement -fwide-exec-charset yet, what do you think should be the default behaviour for the interim?
Perhaps an Internal compiler error to indicate that appropriate support is not yet in place?
================
Comment at: clang/test/CodeGen/systemz-charset.c:16-23
+char *EscapeCharacters = "\a\b\f\n\r\t\v\\\'\"\?";
+//CHECK: c"/\16\0C\15\0D\05\0B\E0}\7Fo\00"
+
+char *HexCharacters = "\x12\x13\x14";
+//CHECK: c"\12\13\14\00"
+
+char *OctalCharacters = "\141\142\143";
----------------
`const char*` here too please.
================
Comment at: clang/test/CodeGen/systemz-charset.cpp:1-25
+// RUN: %clang %s -std=c++17 -emit-llvm -S -target s390x-ibm-zos -o - | FileCheck %s
+
+const char *RawString = R"(Hello\n)";
+//CHECK: c"\C8\85\93\93\96\E0\95\00"
+
+char UnicodeChar8 = u8'1';
+//CHECK: i8 49
----------------
This is good. I suggest adding escape sequences and UCNs to validate that they are not converted to IBM-1047.
================
Comment at: clang/test/Driver/clang_f_opts.c:212-213
-// RUN: %clang -### -S -fexec-charset=iso-8859-1 -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
-// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'iso-8859-1' in '-fexec-charset=iso-8859-1'
+// RUN: %clang -### -S -fexec-charset=invalid-charset -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-INVALID-INPUT-CHARSET %s
+// CHECK-INVALID-INPUT-CHARSET: error: invalid value 'invalid-charset' in '-fexec-charset'
----------------
This looks good. Can tests also be added to validate that the `UTF-8`, `ISO8895-1`, and `IBM-1047` option arguments are properly recognized?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D93031/new/
https://reviews.llvm.org/D93031
More information about the llvm-commits
mailing list