[libcxx-commits] [PATCH] D144742: [libc++][format] Improves fill character.

Thu May 18 10:07:17 PDT 2023

Mordante added a comment.

Thanks fro the reviews!

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:451-456
+    __unicode::__consume_result __consumed = __view.__consume();
+    ptrdiff_t __code_units                 = __view.__position() - __begin;
+    if (__view.__position() < __end) {
+      if (__parse_alignment(*__view.__position())) {
+        if (__consumed.__status != __unicode::__consume_result::__ok)
+          std::__throw_format_error("The fill character contains an invalid value");
----------------
tahonermann wrote:
> Mordante wrote:
> > Mordante wrote:
> > > tahonermann wrote:
> > > > This seems a little odd to me. When consumption of the fill character fails (due to an encoding issue), an attempt is still made to parse the alignment at the new position before checking for the consumption error and then reporting a parse issue. I'm not sure why that attempt is made since success is going to lead to reporting the fill character failure and failure is going to result in falling through to retry parsing the alignment at the beginning anyway. If the intent is to get to different error messages, that seems reasonable, but it seems this can fallback to trying to parse the alignment at the beginning anyway. Am I missing something?
> > > I wrote this in this way due to historic reasons. A character like 0 is a fill-character when followed by an alignment. Else it's a zero-padding. Since valid elements of the format-spec are not invalid Unicode this is not needed. I adjusted it, but reverted it again. For UTF-32 it makes sense to only test when it's a fill character, and I like to keep the same diagnostic regardless of the encoding used. So I added comment to explain the design.
> > @tahonermann are you happy with this comment?
> I agree with the goal of keeping the diagnostics aligned, but I don't think this change does that.
> 
> The concern I have is that, when a code point isn't decoded by the call to `__view.__consume()`, the current location in the view will have been bumped by one code unit (based on my reading of the `__consume()` implementation for the `__code_point_view<char>` explicit specialization). There isn't much reason to expect a code point to be successfully decoded at that location. The likely result is that this block isn't entered (because the call to `__parse_alignment(*__view.__position()` above fails to match an alignment character; note that it doesn't even attempt proper decoding) and execution falls through to the `__parse_alignment(*__begin)` below which will likely return false thus leading to `__parse_fill_align()` returning with an indication that the //fill-and-align// option is not present. I haven't looked into what might happen next.
> 
> The UTF-8 and UTF-16 cases are fundamentally different compared to UTF-32 since they are variable length encodings and the latter is a trivial fixed length encoding. I think it is reasonable to throw a distinct error in this case since this problem cannot occur in UTF-32 (a failure to decode a code point at all is subtly different than decoding a code point that is not a UCS scalar value). It is trivially easy to step to the next code unit sequence in UTF-32, but not in UTF-8 or UTF-16.
Good point thanks! I tested with your additional suggested tests and the `fill-and-align` is indeed not properly detected for some of the new test cases. This results in a replacement-field that doesn't end with a '}'. There the error doesn't mention fill either. So having a different error for UTF-32 and UTF-8/16 seems the way forward. (I don't feel like doing more effort to possibly detect an alignment for an error message to be worth the code size.)

================
Comment at: libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp:84
+    check_exception("The fill character contains an invalid value", SV("{:\xf4\x90\x80\x80^}"), 42); // U+110000
+    check_exception("The fill character contains an invalid value", SV("{:\xf4\x90\xbf\xbf^}"), 42); // U+11FFFF
+#ifndef TEST_HAS_NO_WIDE_CHARACTERS
----------------
tahonermann wrote:
> Suggested tests for ill-formed UTF-8 code unit sequences:
>   check_exception("???", SV("{:\x80^}"), 42);         // Trailing code unit with no leading one.
>   check_exception("???", SV("{:\xc0^}"), 42);         // Missing trailing code unit.
>   check_exception("???", SV("{:\xe0\x80^}"), 42);     // Missing trailing code unit.
>   check_exception("???", SV("{:\xf0\x80^}"), 42);     // Missing two trailing code units.
>   check_exception("???", SV("{:\xf0\x80\x80^}"), 42); // Missing trailing code unit.
> 
> 
I tested this before changing the code in the header. The first to result in `The fill character contains an invalid value`, the third `"The fill character contains an invalid value`. (I didn't test the fourth and fifth.)

================
Comment at: libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp:87-90
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xd800^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xdbff^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xdc00^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xddff^}"}, 42);
----------------
tahonermann wrote:
> These all test lone surrogates. Here is a suggested test for reversed surrogates:
>   check_exception("???", std::wstring_view{L"{:\xdc00\xd800^}"}, 42);
This also results in `"The format-spec should consume the input or end with a '}'`

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144742/new/

https://reviews.llvm.org/D144742