[libcxx-commits] [PATCH] D126971: [libc++] Implements Unicode grapheme clustering

Sat Jul 9 08:30:03 PDT 2022

vitaut requested changes to this revision.
vitaut added a comment.
This revision now requires changes to proceed.

Mostly looks good but at least one output doesn't look right which suggests an error in width estimation. Also a few nits inline.

================
Comment at: libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp:9

-#include <array>
-#include <format>
+#  include <array>
+#  include <format>
----------------
Looks like this include is no longer needed since you removed the only std::array usage.

================
Comment at: libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp:245
+    // The benchmark uses a large precision, which forces the formatting
+    // enigine to determine the estimated width. (There's no direct way to call
+    // this function in portable code.)
----------------
enigine -> engine

================
Comment at: libcxx/include/__format/formatter_output.h:283

+namespace __detail {
 template <class _CharT>
----------------
Why is `__truncate` in the `__detail` namespace while other functions which are also implementation details aren't?

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:713
+  size_t __width_;
+  /// The last parsed element.
+  ///
----------------
Is it the last or one past the last? Seems like the latter. Also "parsed" seems a bit unclear in this context. Are you referring to grapheme clusterization as parsing? I would also recommend using a more descriptive name like `__end_` or `__last_`.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:719
+
+/// There is a small issue when estimating the column width.
+/// Since a column width can be two it's possible that the requested column
----------------
nit: I wouldn't call it an issue. It's how width and precision work and very much expected behavior =).

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:797
+    // might be returned. For example when a width of 100 is requested the
+    // returned with might be 99, since the next code point has an estimated
+    // column width of 2. This depends on the rounding flag.
----------------
with -> width

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:825
+//
+// The code point to the number of columns isn't well defined. The code uses the
+// estimations defined in [format.string.std]/11. This list might change in the
----------------
It is well-defined in the same way the Unicode standard we are referring to is well-defined. I suggest saying that it can change instead, particularly with the change to the Unicode standard.

================
Comment at: libcxx/include/__format/unicode.h:73
+
+  _LIBCPP_HIDE_FROM_ABI constexpr char32_t __consume() noexcept {
+    _LIBCPP_ASSERT(__first_ != __last_, "can't move beyond the end of input");
----------------
As a potential future optimization (not in this diff) you might want to consider branchless code point decoding, e.g. https://github.com/fmtlib/fmt/blob/b761f1279e2dfb950a7bf9ff7128d6b03b77b013/include/fmt/format.h#L571-L588.

================
Comment at: libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp:31
+  //*** 2-byte code points ***
+  assert(std::format("{:*^4}", "\u00a1") == "*\u00a1*"); // INVERTED EXCLAMATION MARK
+  assert(std::format("{:*^4}", "\u07ff") == "*\u07ff*"); // NKO TAMAN SIGN
----------------
I think the output should be "*\u00a1**" because U+00A1 has an estimated width of 1: http://eel.is/c++draft/format#string.std-11.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126971/new/

https://reviews.llvm.org/D126971