[libcxx-commits] [PATCH] D144499: [libc++][format] Improves width estimate.

Wed Apr 12 12:25:14 PDT 2023

tahonermann added a comment.

This looks great. I added comments for some minor issues.

I used the following for cross reference purposes. I didn't validate every range, but I did most of them and didn't find any discrepancies.

- https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEast_Asian_Width%3DFullwidth%3A%5D%5B%3AEast_Asian_Width%3DWide%3A%5D%5B%3ABlock%3DYijing_Hexagram_Symbols%3A%5D%5B%3ABlock%3DMiscellaneous_Symbols_And_Pictographs%3A%5D%5B%3ABlock%3DSupplemental_Symbols_And_Pictographs%3A%5D&ucd=on&g=&i=

================
Comment at: libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp:55-57
+  // The first 256 non valid code points
+  for (char32_t c = 0x110000; c <= 0x1000FF; ++c)
+    test(c);
----------------
The values don't make sense here; the initial value is larger than the condition value so the loop body is never entered.

================
Comment at: libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp:62-63
+  // Entries after the table have a width of 1.
+  static_assert(*(std::end(std::__width_estimation_table::__entries) - 1) == ((0x3c000u << 14) | 16381u),
+                "validate whether the optimizations im __estimated_width are still valid");
+  assert(std::__width_estimation_table::__estimated_width(0x3fffd) == 2);
----------------

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:82
+  check(SV("*\u3041*"), SV("{:*^4}"), SV("\u3041")); // U+3041 HIRAGANA LETTER SMALL A
+  check(SV("*\ua4d0*"), SV("{:*^3}"), SV("\ua4d0")); // U+A4D0 LISU LETTER BA (P2675 changed this to 1)

----------------
The comment seems off here; U+A4D0 had a width of 1 both before and after P2675.

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:94
   check(SV("*\ufe30*"), SV("{:*^4}"), SV("\ufe30")); // PRESENTATION FORM FOR VERTICAL TWO DOT LEADER
-  check(SV("*\ufe6f*"), SV("{:*^4}"), SV("\ufe6f")); // U+FE70 ARABIC FATHATAN ISOLATED FORM
+  check(SV("*\ufe70*"), SV("{:*^3}"), SV("\ufe70")); // U+FE70 ARABIC FATHATAN ISOLATED FORM (P2675 changed this to 1)

----------------
The comment seems off here; U+FE70 had a width of 1 both before and after P2675.

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:144
+  check(SV("***"), SV("{:*^3.1}"), SV("\u3041"));      // U+3041 HIRAGANA LETTER SMALL A
+  check(SV("*\ua4d0*"), SV("{:*^3.1}"), SV("\ua4d0")); // U+A4D0 LISU LETTER BA (P2675 changed this to 1)

----------------
The comment seems off here; U+A4D0 had a width of 1 both before and after P2675.

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:156
+  check(SV("***"), SV("{:*^3.1}"), SV("\ufe30"));      // PRESENTATION FORM FOR VERTICAL TWO DOT LEADER
+  check(SV("*\ufe70*"), SV("{:*^3.1}"), SV("\ufe70")); // U+FE70 ARABIC FATHATAN ISOLATED FORM (P2675 changed this to 1)

----------------
The comment seems off here; U+FE70 had a width of 1 both before and after P2675.

================
Comment at: libcxx/utils/generate_width_estimation_table.py:136-140
+/// The maximum code point that has an estimated width of 2 is U+3FFFD. This
+/// value can be encoded in 18 bits. Thus the upper 3 bits of the code point
+/// always 0. These 3 bits are used to enlarge the offset range. This
+/// optimization reduces the table in Unicode 15 from 184 to 104 entries,
+/// saving 320 bytes.
----------------

================
Comment at: libcxx/utils/generate_width_estimation_table.py:151-153
+/// Values greater than this value may have more than 18 significate bits.
+/// They always have a width of 1. This property makes it possible to store
+/// the table in its compact form.
----------------

================
Comment at: libcxx/utils/generate_width_estimation_table.py:197-242
+// UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE
+//
+// See Terms of Use <https://www.unicode.org/copyright.html>
+// for definitions of Unicode Inc.'s Data Files and Software.
+//
+// NOTICE TO USER: Carefully read the following legal agreement.
+// BY DOWNLOADING, INSTALLING, COPYING OR OTHERWISE USING UNICODE INC.'S
----------------
Is there something we can do to ensure this gets updated with newer Unicode releases? Perhaps pull the copyright notice from somewhere to run a comparison when the table is regenerated?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144499/new/

https://reviews.llvm.org/D144499