[PATCH] D93031: Enable fexec-charset option

Richard Smith - zygoloid via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Mar 2 12:12:36 PST 2021


rsmith added inline comments.


================
Comment at: clang/include/clang/Lex/Preprocessor.h:145
   ModuleLoader      &TheModuleLoader;
+  LiteralConverter  LT;
 
----------------
Please give this a longer name. Abbreviation names should only be used in fairly small scopes where it's easy to look up what they refer to.

Also: why `LT`? What does the `T` stand for?


================
Comment at: clang/lib/Driver/ToolChains/Clang.cpp:6178
+  CmdArgs.push_back(Args.MakeArgString(Triple.getSystemCharset()));
+  for (auto it = vList.begin(), ie = vList.end(); it != ie; ++it) {
+    llvm::ErrorOr<llvm::CharSetConverter> ErrorOrConverter =
----------------
Looping over all the arguments is a little unusual. Normally we'd get the last argument value and only check that one. Do you need to pass more than one value onto the frontend?


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:237
+    SmallString<8> ResultCharConv;
+    Converter->convert(std::string(1, ByteChar), ResultCharConv);
+    assert(ResultCharConv.size() == 1 &&
----------------
Can you avoid using `std::string` here? Eg, pass a `StringRef`, extending the converter to be able to take one if necessary.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1363-1364
+          if (!HadError && Converter) {
+            assert(Kind != tok::wide_char_constant &&
+                   "Wide character translation not supported");
+            SmallString<1> ConvertedChar;
----------------
Why is this case not possible?


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1367
+            Converter->convert(StringRef((char *)tmp_out_start), ConvertedChar);
+            memmove((void *)tmp_out_start, ConvertedChar.data(), 1);
+          }
----------------
What assurance do we have that 1 output character is correct? I would expect we need to reject with a diagnostic if the character doesn't fit in one converted character.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1700-1702
         // Point into the \n inside the \r\n sequence and operate on the
         // remaining portion of the literal.
         RemainingTokenSpan = AfterCRLF.substr(1);
----------------
Do we need to convert the newline character too?

Perhaps for raw string literals it'd be better to do the normal processing here and then convert the entire string at once?


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:234
+    SmallString<8> ResultCharConv;
+    Converter->convert(std::string(1, ByteChar), ResultCharConv);
+    memcpy((void *)&ResultChar, ResultCharConv.data(), sizeof(unsigned));
----------------
abhina.sreeskantharajan wrote:
> tahonermann wrote:
> > Conversion can fail here, particularly in the scenario corresponding to the default switch case above; `ResultChar` could contain, for example, a lead byte of a UTF-8 sequence.  Something sensible should be done here; either rejecting the code with an error or substituting `?` (in the execution encoding) seems appropriate to me.
> Thanks, I added the substitution with the '?' character for invalid escapes.
This is a regression. Our prior behavior for unknown escapes was to leave the character alone. We should still do that wherever possible -- eg, `\q` should produce `q` -- and take fallback action only if the character is unencodable. Producing a `?` seems unlikely to ever be what anyone wants; producing a hard error would seem preferable.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+      void *Pointer = &ResultChar;
+      memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned));
+    }
----------------
abhina.sreeskantharajan wrote:
> abhina.sreeskantharajan wrote:
> > rsmith wrote:
> > > What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host.
> > This may be a problem we need to revisit since ResultChar is expecting a char.
> I added an assertion for this case where the size of the character increases after translation. I've also removed the memcpy to avoid endianness issues.
Is there any guarantee the assertion will not fail?


================
Comment at: clang/test/Driver/cl-options.c:214
+// RUN: %clang_cl /execution-charset:invalid-charset -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-invalid %s
+// execution-charset-invalid: invalid value 'invalid-charset' in '-fexec-charset'
 //
----------------
Please use the given spelling of the flag in the diagnostic. (You can ask the argument how it was spelled.)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031



More information about the llvm-commits mailing list