[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Hubert Tong via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Sat Aug 12 16:16:39 PDT 2023


hubert.reinterpretcast added inline comments.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16899
+          uint32_t CodePoint = static_cast<uint32_t>(V.getInt().getZExtValue());
+          PrintCharLiteralPrefix(BTy->getKind(), OS);
+          OS << '\'';
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > hubert.reinterpretcast wrote:
> > > cor3ntin wrote:
> > > > Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just leave that part out.
> > > Why is removing the prefix better? The types can matter (characters outside the basic character set are allowed to have negative `char` values). Also, moving forward, the value of a character need not be the same in the various encodings.
> > Some fun with signedness (imagine a more realistic example with `ISO-8859-1` ordinary character encoding with a signed `char` type):
> > ```
> > $ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] == U\'\\uFF10\');'
> > <stdin>:1:15: error: static assertion failed due to requirement 'L"\xFF10"[0] == U'\uff10''
> >     1 | static_assert(L"\uFF10"[0] == U'\uFF10');
> >       |               ^~~~~~~~~~~~~~~~~~~~~~~~~
> > <stdin>:1:28: note: expression evaluates to ''0' (0xFF10) == '0' (0xFF10)'
> >     1 | static_assert(L"\uFF10"[0] == U'\uFF10');
> >       |               ~~~~~~~~~~~~~^~~~~~~~~~~~
> > 1 error generated.
> > Return:  0x01:1   Fri Aug 11 23:49:02 2023 EDT
> > ```
> Either we care about the actual character - ie `'a'`, or it's value (ie `42`). The motivation for the current patch is to add the value to the diagnostic message.
> I'm also concerned about mixing things that are are are not lexical elements in the diagnostics
Maybe the //motivation// for the current patch is to add the value, but what it does (for wide characters as defined in C) is to add the character (and obfuscate the value).

Observe the status quo (https://godbolt.org/z/Wc6nKvTMn):
```
note: expression evaluates to '-240 == 65296'
```

>From the output higher up (with this patch), we see two "identical" characters and values (due to lack of decimal value output). With decimal value output added, it will still be potentially confusing why the two identical characters have different values (without some sort of type annotation).

I admit that the confusion arises in the status quo treatment of `signed char` and `unsigned char`. I hope I am using the word correctly when I say that it is ironic that the patch breaks in one context what it seeks to fix in another.



================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16869
+/// The code point needs to be zero-extended to 32-bits.
+void ConvertCharToString(uint32_t CodePoint, const BuiltinType *BTy,
+                         unsigned TyWidth, llvm::raw_ostream &OS) {
----------------
It does not seem that the first parameter expects a `CodePoint` argument in all cases. For `Char_S`, `Char_U`, and `Char8`, it seems the function wants to treat the input as a UTF-8 code unit.

I suggest changing the argument to be clearly a code unit (and potentially treat it as a code point value as appropriate later in the function).

Also: The function should probably be declared as having static linkage.
Additionally: The function does not "convert" in the language semantic sense. `WriteCharacterValueDescriptionForDisplay` might be a better name.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16876
+  // other types.
+  if (CodePoint <= UCHAR_MAX) {
+    StringRef Escaped = escapeCStyle<EscapeChar::Single>(CodePoint);
----------------
For types other than `Char_S`, `Char_U`, and `Char8`, this fails to treat the C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks like test coverage for these cases are missing.


================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16930-16936
+        case BuiltinType::Char_S:
+        case BuiltinType::Char_U:
+        case BuiltinType::Char8:
+        case BuiltinType::Char16:
+        case BuiltinType::Char32:
+        case BuiltinType::WChar_S:
+        case BuiltinType::WChar_U: {
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > Add FIXME for `char` and `wchar_t` cases that this assumes Unicode literal encodings.
> If we wanted such fixme, it should be L1689.
The `ConvertCharToString` has a first parameter called `CodePoint`. With that interface[^1], it is sensible to insert conversion from the applicable literal encoding to a Unicode code point value here (thus my request for a FIXME here).

You are probably right that the FIXME belongs elsewhere. If you were thinking what I am thinking, then I am guessing you meant L16894? That is where the `ConvertCharToString` function seems to assume that a `wchar_t` value is directly a "code point value". To generate hex escapes, the function needs to be passed the original value (including for `char`s, e.g., to handle stray code units). Once the interface is updated (i.e., the parameter is renamed), the `ConvertCharToString` function would more clearly be the place to put one or more FIXMEs about encoding assumptions.

[^1]: It turns out that the parameter is already not treated consistently as a code point value within the function (and by the caller) and the parameter is just badly named.



================
Comment at: clang/test/SemaCXX/static-assert.cpp:287
+  static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \
+                                                  // expected-note {{evaluates to ''ゆ' (0x3086) == '̵' (0x335)'}}
+  static_assert(L"\/"[1] == u'\xFFFD', ""); // expected-error {{failed}} \
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit.
> > 
> > Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335
> > 
> This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp)
This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing
```
static_assert(false, "\u0301");
```
gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after `: `).

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155610/new/

https://reviews.llvm.org/D155610



More information about the cfe-commits mailing list