[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Sat Sep 16 04:09:00 PDT 2023

cor3ntin added inline comments.

================
Comment at: clang/test/SemaCXX/static-assert-cxx26.cpp:304
+static_assert('\u{9}' == (char)1, ""); // expected-error {{failed}} \
+                                       // expected-note {{evaluates to ''\t' (0x09, 9) == '<U+0001>' (0x01, 1)'}}
+static_assert((char8_t)-128 == (char8_t)-123, ""); // expected-error {{failed}} \
----------------
tahonermann wrote:
> cor3ntin wrote:
> > tahonermann wrote:
> > > cor3ntin wrote:
> > > > tahonermann wrote:
> > > > > Is the expected note up to date? I don't see code that would generate the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, though non-printable, character, I would expect more `'\u0001'`.
> > > > See elsewhere in the discussion. this formating is pre existing and managed at the DiagnosticEngine level (pushEscapedString). the reason it's not `\u0001` is 1/ to avoid  reusing c++ syntactic elements for something that comes from diagnostics and is not represented as an escaped sequence in source 2/ `\u00011` is unreadable, and `\U000000001` is also not helpful :)
> > > > 
> > > Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're already putting the value in single quotes and representing some values with escapes in many of these cases when the value isn't produced by an escape sequence (or even a character/string literal); why exclude `\uXXXX`? I agree with the rationale for (2); we could use `'\u{1}'` in that case.
> > FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow Unicode notation (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html).
> > Oldest instance seems to be https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db - i followed suite when reworking the generic escaping mechanism all string fed to diagnostics go through.
> > 
> > I don't care about changing the syntax, but i do hope we are consistent. Ultimately what we are trying to do is to designate a unicode codepoint and whether we do it through C++ syntax or not probably does not matter much as long as it's clear, delimited and consistent!
> I think the substitution of `<U+XXXX>` by the diagnostic engine itself is perfectly fine and good; particularly when it has no context to suggest a different presentation. In this particular case, where the character is being presented using C++ syntax as a character literal, I would prefer that C++ syntax be used consistently.
> 
> From an implementation standpoint, I'm suggesting that `WriteCharValueForDiagnostic()` be modified such that, if `escapeCStyle<EscapeChar::Single>()` returns an empty string, that the character be presented in `'\u{XXXX}'` form if the character is one that would otherwise be substituted by the diagnostic engine (e.g., if `isPrintable()` is false). Note that this would be restricted to `char` values <= 0x7F; larger values could still be passed through as invalid code units that the diagnostic engine would then render as, e.g., `'<FC>'`.
We should take a decision before forcing the author to do further change as to avoid going in circle.
As a user `\u{XXXX}` vs `<U+XXXX>` makes no difference in terms of the amount of information i receive.

I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid `char`. But I'm also concerned about spending too much time on `char` literals in `static_assert`. It might not be a common enough use case to warrant that much scrutiny :)

So I would be happy to go with _any_ direction

================
Comment at: clang/test/SemaCXX/static-assert.cpp:287
+  static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \
+                                                  // expected-note {{evaluates to ''ゆ' (0x3086) == '̵' (0x335)'}}
+  static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \
----------------
hubert.reinterpretcast wrote:
> tahonermann wrote:
> > cor3ntin wrote:
> > > hubert.reinterpretcast wrote:
> > > > hubert.reinterpretcast wrote:
> > > > > cor3ntin wrote:
> > > > > > hubert.reinterpretcast wrote:
> > > > > > > The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit.
> > > > > > > 
> > > > > > > Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335
> > > > > > > 
> > > > > > This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp)
> > > > > This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing
> > > > > ```
> > > > > static_assert(false, "\u0301");
> > > > > ```
> > > > > gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after `: `).
> > > > > 
> > > > > This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).
> > > > @cor3ntin, do you have status quo examples for how grapheme-extending characters that are not already "problematic" in their original context are emitted in diagnostics in contexts where they are?
> > > Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me
> > > That shows that clang has no understanding of graphemes
> > gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon
> I was more looking for cases where the output grapheme includes elements that were part of the fixed message text (gobbling quotes, etc.). Also, this patch is in the wrong shape for handling this concern in the diagnostic printer because the delimiting of the replacement text happens in this patch.
Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and `static_assert` diagnostics are not of the right shape.

> This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155610/new/

https://reviews.llvm.org/D155610