[PATCH] D153914: [clang-cl] Enable concatenation of predefined identifiers

Mon Jul 24 13:27:29 PDT 2023

tahonermann added inline comments.

================
Comment at: clang/lib/Sema/SemaExpr.cpp:2010-2013
+    OS << "R\"EFLPM("
+       << PredefinedExpr::ComputeName(getPredefinedExprKind(Tok.getKind()),
+                                      currentDecl)
+       << ")EFLPM\"";
----------------
RIscRIpt wrote:
> tahonermann wrote:
> > "EFLPM"? I'm assuming this is an attempt to avoid an accidental clash with the computed name. Does this suffice to ensure a clash can't happen? If not, can we do better? Per http://eel.is/c++draft/lex.string#nt:r-char and http://eel.is/c++draft/lex.charset#1, we have a lot of flexibility regarding which characters to use.
> At first I thought you were hinting me to use non-ascii characters for d-char-sequence. However, when I looked closely d-char-sequence is not as flexible as r-chars that you referenced: http://eel.is/c++draft/lex.string#nt:d-char and http://eel.is/c++draft/lex.charset#2
> 
> Here're my thoughts:
> 1. I was afraid that there could be a possibility of a function name contain either quote (`"`) or a backslash (`\`) despite it seemed impossible.
> 2. At first I used `Lexer::Stringify` to make it bullet-proof. I didn't like it, neither did you (reviewers).
> 3. I replaced `Lexer::Stringify` with raw-string-literal (d-char-sequence was chosen as acronym of current function name).
> 4. Current `EFLPM` looks weird and cryptic. And we are limited to [lex.charset.basic] with exceptions (space and earlier chars are not allowed). I think, using a raw-string-literal without d-char-sequence would be enough, because problem could be caused only by two consecutive characters `)"`, neither of which can appear in a function name.
Sorry about that; you are absolutely right that `d-char-sequence` is (much) more constrained than `r-char-sequence`.

For `__FUNCSIG__`, the generated text can include arbitrary text. For example, given this declaration:
  void f(decltype(sizeof("()"))*)
the macro value is:
  void __cdecl f(decltype(sizeof ("()")) *)
which does contain `)"`.

I think we should do more to avoid any possibility of the resulting string being unparseable because the resulting diagnostics will be completely inscrutable. The best way to do that would be to avoid trying to produce a parseable string literal in the first place. Based on that and given that `Sema::ActOnStringLiteral()` immediately uses `StringLiteralParser` to produce an expanded string literal, I think we would be better off moving this expansion there such that these macros can be expanded to already cooked string literals that don't need to be parsed at all. This will mean having to modify the `StringLiteralParser` constructor to accept a `Decl*` pointer that will be passed `getCurLocalScopeDecl()` in `Sema::ActOnStringLiteral()`. `StringLiteralParser` can then assert if it ever encounters one of these function local predefined tokens when it hasn't been provided a corresponding `Decl*` to use with them.

================
Comment at: clang/test/Sema/ms_predefined_expr.cpp:115
+                                                                                                       // expected-warning{{string literal concatenation of predefined identifier 'L__FUNCSIG__' is a Microsoft extension}}
+}
+
----------------
RIscRIpt wrote:
> tahonermann wrote:
> > How about testing some other encoding prefix combinations?
> >   u"" __FUNCTION // Ok; UTF-16 string literal
> >   u"" L__FUNCTION__ // ill-formed.
> > 
> Good idea. I am not sure whether I should specify `-std` in the test command. I'll leave it without, because current default standard is C++17, and I believe it's not going to be decreased.
> 
> And I think there are enough tests of checking values of these macros. So, in tests for encoding I am going to check types.
Thank you for adding those. I think we should check values for a couple of cases as well. Consider a function name that contains `𐀀` (U+10000 LINEAR B SYLLABLE B008 A). In UTF-8, that character requires four code units (F0 90 80 80) while in UTF-16 it requires two (D800 DC00). We'll want to ensure that such characters are properly converted between encodings.
  void test_encoding𐀀() {
    ASSERT_EQ(u"" __FUNCTION__, u"test_encoding𐀀");
  }

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153914/new/

https://reviews.llvm.org/D153914