[libcxx-commits] [PATCH] D126971: [libc++] Implements Unicode grapheme clustering
Victor Zverovich via Phabricator via libcxx-commits
libcxx-commits at lists.llvm.org
Sat Jul 9 08:30:03 PDT 2022
vitaut requested changes to this revision.
vitaut added a comment.
This revision now requires changes to proceed.
Mostly looks good but at least one output doesn't look right which suggests an error in width estimation. Also a few nits inline.
================
Comment at: libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp:9
-#include <array>
-#include <format>
+# include <array>
+# include <format>
----------------
Looks like this include is no longer needed since you removed the only std::array usage.
================
Comment at: libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp:245
+ // The benchmark uses a large precision, which forces the formatting
+ // enigine to determine the estimated width. (There's no direct way to call
+ // this function in portable code.)
----------------
enigine -> engine
================
Comment at: libcxx/include/__format/formatter_output.h:283
+namespace __detail {
template <class _CharT>
----------------
Why is `__truncate` in the `__detail` namespace while other functions which are also implementation details aren't?
================
Comment at: libcxx/include/__format/parser_std_format_spec.h:713
+ size_t __width_;
+ /// The last parsed element.
+ ///
----------------
Is it the last or one past the last? Seems like the latter. Also "parsed" seems a bit unclear in this context. Are you referring to grapheme clusterization as parsing? I would also recommend using a more descriptive name like `__end_` or `__last_`.
================
Comment at: libcxx/include/__format/parser_std_format_spec.h:719
+
+/// There is a small issue when estimating the column width.
+/// Since a column width can be two it's possible that the requested column
----------------
nit: I wouldn't call it an issue. It's how width and precision work and very much expected behavior =).
================
Comment at: libcxx/include/__format/parser_std_format_spec.h:797
+ // might be returned. For example when a width of 100 is requested the
+ // returned with might be 99, since the next code point has an estimated
+ // column width of 2. This depends on the rounding flag.
----------------
with -> width
================
Comment at: libcxx/include/__format/parser_std_format_spec.h:825
+//
+// The code point to the number of columns isn't well defined. The code uses the
+// estimations defined in [format.string.std]/11. This list might change in the
----------------
It is well-defined in the same way the Unicode standard we are referring to is well-defined. I suggest saying that it can change instead, particularly with the change to the Unicode standard.
================
Comment at: libcxx/include/__format/unicode.h:73
+
+ _LIBCPP_HIDE_FROM_ABI constexpr char32_t __consume() noexcept {
+ _LIBCPP_ASSERT(__first_ != __last_, "can't move beyond the end of input");
----------------
As a potential future optimization (not in this diff) you might want to consider branchless code point decoding, e.g. https://github.com/fmtlib/fmt/blob/b761f1279e2dfb950a7bf9ff7128d6b03b77b013/include/fmt/format.h#L571-L588.
================
Comment at: libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp:31
+ //*** 2-byte code points ***
+ assert(std::format("{:*^4}", "\u00a1") == "*\u00a1*"); // INVERTED EXCLAMATION MARK
+ assert(std::format("{:*^4}", "\u07ff") == "*\u07ff*"); // NKO TAMAN SIGN
----------------
I think the output should be "*\u00a1**" because U+00A1 has an estimated width of 1: http://eel.is/c++draft/format#string.std-11.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D126971/new/
https://reviews.llvm.org/D126971
More information about the libcxx-commits
mailing list