[libcxx-commits] [PATCH] D126971: [libc++] Implements Unicode grapheme clustering

Sat Jul 9 12:23:22 PDT 2022

Mordante marked 9 inline comments as done.
Mordante added a comment.

Thanks a lot for your review!

================
Comment at: libcxx/include/__format/formatter_output.h:283

+namespace __detail {
 template <class _CharT>
----------------
vitaut wrote:
> Why is `__truncate` in the `__detail` namespace while other functions which are also implementation details aren't?
Good point there used to be more in `__detail`, I've removed that inner namespace.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:713
+  size_t __width_;
+  /// The last parsed element.
+  ///
----------------
vitaut wrote:
> Is it the last or one past the last? Seems like the latter. Also "parsed" seems a bit unclear in this context. Are you referring to grapheme clusterization as parsing? I would also recommend using a more descriptive name like `__end_` or `__last_`.
I agree, `__last_` would be a better name and consistent with other parts of the code.
I'm removed the word "parsed".

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:719
+
+/// There is a small issue when estimating the column width.
+/// Since a column width can be two it's possible that the requested column
----------------
vitaut wrote:
> nit: I wouldn't call it an issue. It's how width and precision work and very much expected behavior =).
I agree it's as specified :-) I removed this entire sentence, it didn't add much information not captures in the other comment.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:825
+//
+// The code point to the number of columns isn't well defined. The code uses the
+// estimations defined in [format.string.std]/11. This list might change in the
----------------
vitaut wrote:
> It is well-defined in the same way the Unicode standard we are referring to is well-defined. I suggest saying that it can change instead, particularly with the change to the Unicode standard.
I removed most of this sentence and just mention the specifications in [format.string.std]/11. 

================
Comment at: libcxx/include/__format/unicode.h:73
+
+  _LIBCPP_HIDE_FROM_ABI constexpr char32_t __consume() noexcept {
+    _LIBCPP_ASSERT(__first_ != __last_, "can't move beyond the end of input");
----------------
vitaut wrote:
> As a potential future optimization (not in this diff) you might want to consider branchless code point decoding, e.g. https://github.com/fmtlib/fmt/blob/b761f1279e2dfb950a7bf9ff7128d6b03b77b013/include/fmt/format.h#L571-L588.
This looks very interesting, thanks for the hint!

================
Comment at: libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp:31
+  //*** 2-byte code points ***
+  assert(std::format("{:*^4}", "\u00a1") == "*\u00a1*"); // INVERTED EXCLAMATION MARK
+  assert(std::format("{:*^4}", "\u07ff") == "*\u07ff*"); // NKO TAMAN SIGN
----------------
vitaut wrote:
> I think the output should be "*\u00a1**" because U+00A1 has an estimated width of 1: http://eel.is/c++draft/format#string.std-11.
This test tests the output when using ASCII. When using ASCII the codepoint uses two bytes and thus `*\u00a1*` is correct. In the Unicode test the column width is estimated at one. I left a comment in that file.

================
Comment at: libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp:95
+    //*** 2-byte code points ***
+    assert(std::format(L"{:*^3}", L"\u00a1") == L"*\u00a1*"); // INVERTED EXCLAMATION MARK
+    assert(std::format(L"{:*^3}", L"\u07ff") == L"*\u07ff*"); // NKO TAMAN SIGN
----------------
@vitaut This is the unicode test where the column width is 1.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126971/new/

https://reviews.llvm.org/D126971