[PATCH] D135366: [clang][Interp] Implement String- and CharacterLiterals

Mon Oct 10 14:06:50 PDT 2022

tahonermann added inline comments.

================
Comment at: clang/lib/AST/Interp/ByteCodeExprGen.cpp:445-446
+    const CharacterLiteral *E) {
+  if (Optional<PrimType> T = classify(E->getType()))
+    return this->emitConst(E, E->getValue());
+
----------------
`T` isn't used and `classify()` already returns an `Optional<PrimType>`, so no conversion coercion appear to be intended here. I think this can be simplified to:
  if (classify(E))
    this->emitConst(E, E->getValue());

================
Comment at: clang/test/AST/Interp/arrays.cpp:138
+  constexpr char32_t c = U'\U0001F60E';
+  static_assert(c == U'😎');
+
----------------
This check depends on the source file being UTF-8 encoded. Perhaps compare against `0x0001F60EL` instead.

================
Comment at: clang/test/AST/Interp/arrays.cpp:143
+
+};
----------------
As others already noted, additional testing of multicharacter literals and UCNs (including named universal characters like `\N{LATIN_CAPITAL_LETTER_E}` would be beneficial. Some tests of character escapes like `\t` wouldn't hurt either.

Clang does not yet support use of `-fexec-charset` to set the literal encoding (execution character set) to anything other than UTF-8 though work on that has been done (see D93031). If such work was completed, it would be useful to run some tests against a non-UTF-8 encoding. Maybe next year.

================
Comment at: clang/test/AST/Interp/arrays.cpp:135
+  static_assert(u32[1] == U'b', "");
+};
+
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > shafik wrote:
> > > tbaeder wrote:
> > > > aaron.ballman wrote:
> > > > > I think you need a more coverage for character literals. Some test cases that are interesting: multichar literals (`'abcd'`), character literals with UCNs (`'\uFFFF'`), character literals with numeric escapes (`'\xFF'`). I'm especially interested in seeing whether we handle integer promotions properly, especially when promoting to `unsigned int`.
> > > > I added two more test cases but I'm generally not that familiar with character  literal edge cases and integer promotion, so if you have concrete test cases in mind, that would be great :)
> > > We can find GNU documentation of multi-char literals here: https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html#Implementation-defined-behavior and I believe we follow the same scheme. 
> > > 
> > > There are some weird cases like `'\8'` which compilers seem to treat consistently but generate a diagnostic for.
> > CC @tahonermann and @cor3ntin as text encoding code owners -- they likely know all the worst test cases to be thinking about in this space.
> Most weirdness are taken care of during lexing.
> What is interesting to test during evaluation
> 
>  * multi character literals ie `'abcd'`
>  * `u8` (utf8) and `u` literals (utf16)
> 
> Note that wide characters literals were removed from C++ so it's no longer a concern
I believe Corentin meant that wide **multi**character literals were removed from C++. That is true for C++23 (they were removed by the adoption of [[ https://wg21.link/P2362R3 | P2362R3 ]] during the July 2022 WG21 plenary) but that change was not adopted as a DR, so they remain part of the language for prior language modes. However, they were a conditionally supported feature that, as of Clang 14, are diagnosed as an error in all language modes. All that to say that I agree; no test for them needed.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135366/new/

https://reviews.llvm.org/D135366