[libcxx-commits] [PATCH] D144742: [libc++][format] Improves fill character.

Tom Honermann via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Wed May 17 12:48:46 PDT 2023


tahonermann added inline comments.


================
Comment at: libcxx/include/__format/parser_std_format_spec.h:451-456
+    __unicode::__consume_result __consumed = __view.__consume();
+    ptrdiff_t __code_units                 = __view.__position() - __begin;
+    if (__view.__position() < __end) {
+      if (__parse_alignment(*__view.__position())) {
+        if (__consumed.__status != __unicode::__consume_result::__ok)
+          std::__throw_format_error("The fill character contains an invalid value");
----------------
Mordante wrote:
> Mordante wrote:
> > tahonermann wrote:
> > > This seems a little odd to me. When consumption of the fill character fails (due to an encoding issue), an attempt is still made to parse the alignment at the new position before checking for the consumption error and then reporting a parse issue. I'm not sure why that attempt is made since success is going to lead to reporting the fill character failure and failure is going to result in falling through to retry parsing the alignment at the beginning anyway. If the intent is to get to different error messages, that seems reasonable, but it seems this can fallback to trying to parse the alignment at the beginning anyway. Am I missing something?
> > I wrote this in this way due to historic reasons. A character like 0 is a fill-character when followed by an alignment. Else it's a zero-padding. Since valid elements of the format-spec are not invalid Unicode this is not needed. I adjusted it, but reverted it again. For UTF-32 it makes sense to only test when it's a fill character, and I like to keep the same diagnostic regardless of the encoding used. So I added comment to explain the design.
> @tahonermann are you happy with this comment?
I agree with the goal of keeping the diagnostics aligned, but I don't think this change does that.

The concern I have is that, when a code point isn't decoded by the call to `__view.__consume()`, the current location in the view will have been bumped by one code unit (based on my reading of the `__consume()` implementation for the `__code_point_view<char>` explicit specialization). There isn't much reason to expect a code point to be successfully decoded at that location. The likely result is that this block isn't entered (because the call to `__parse_alignment(*__view.__position()` above fails to match an alignment character; note that it doesn't even attempt proper decoding) and execution falls through to the `__parse_alignment(*__begin)` below which will likely return false thus leading to `__parse_fill_align()` returning with an indication that the //fill-and-align// option is not present. I haven't looked into what might happen next.

The UTF-8 and UTF-16 cases are fundamentally different compared to UTF-32 since they are variable length encodings and the latter is a trivial fixed length encoding. I think it is reasonable to throw a distinct error in this case since this problem cannot occur in UTF-32 (a failure to decode a code point at all is subtly different than decoding a code point that is not a UCS scalar value). It is trivially easy to step to the next code unit sequence in UTF-32, but not in UTF-8 or UTF-16.


================
Comment at: libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp:84
+    check_exception("The fill character contains an invalid value", SV("{:\xf4\x90\x80\x80^}"), 42); // U+110000
+    check_exception("The fill character contains an invalid value", SV("{:\xf4\x90\xbf\xbf^}"), 42); // U+11FFFF
+#ifndef TEST_HAS_NO_WIDE_CHARACTERS
----------------
Suggested tests for ill-formed UTF-8 code unit sequences:
  check_exception("???", SV("{:\x80^}"), 42);         // Trailing code unit with no leading one.
  check_exception("???", SV("{:\xc0^}"), 42);         // Missing trailing code unit.
  check_exception("???", SV("{:\xe0\x80^}"), 42);     // Missing trailing code unit.
  check_exception("???", SV("{:\xf0\x80^}"), 42);     // Missing two trailing code units.
  check_exception("???", SV("{:\xf0\x80\x80^}"), 42); // Missing trailing code unit.




================
Comment at: libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp:87-90
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xd800^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xdbff^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xdc00^}"}, 42);
+    check_exception("The fill character contains an invalid value", std::wstring_view{L"{:\xddff^}"}, 42);
----------------
These all test lone surrogates. Here is a suggested test for reversed surrogates:
  check_exception("???", std::wstring_view{L"{:\xdc00\xd800^}"}, 42);


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144742/new/

https://reviews.llvm.org/D144742



More information about the libcxx-commits mailing list