[libcxx-commits] [PATCH] D144499: [libc++][format] Improves width estimate.

Fri Apr 14 09:54:14 PDT 2023

Mordante marked 8 inline comments as done.
Mordante added a comment.

Thanks for the review!

================
Comment at: libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp:55-57
+  // The first 256 non valid code points
+  for (char32_t c = 0x110000; c <= 0x1000FF; ++c)
+    test(c);
----------------
tahonermann wrote:
> The values don't make sense here; the initial value is larger than the condition value so the loop body is never entered.
The second should have started with 0x11 instead of 0x10. Nice catch. Interesting the compiler didn't issue a diagnostic.

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:82
+  check(SV("*\u3041*"), SV("{:*^4}"), SV("\u3041")); // U+3041 HIRAGANA LETTER SMALL A
+  check(SV("*\ua4d0*"), SV("{:*^3}"), SV("\ua4d0")); // U+A4D0 LISU LETTER BA (P2675 changed this to 1)

----------------
tahonermann wrote:
> The comment seems off here; U+A4D0 had a width of 1 both before and after P2675.
Nice catch! This seems to be a bug in the existing implementation. (The same for the other similar remarks.)

================
Comment at: libcxx/utils/generate_width_estimation_table.py:197-242
+// UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE
+//
+// See Terms of Use <https://www.unicode.org/copyright.html>
+// for definitions of Unicode Inc.'s Data Files and Software.
+//
+// NOTICE TO USER: Carefully read the following legal agreement.
+// BY DOWNLOADING, INSTALLING, COPYING OR OTHERWISE USING UNICODE INC.'S
----------------
tahonermann wrote:
> Is there something we can do to ensure this gets updated with newer Unicode releases? Perhaps pull the copyright notice from somewhere to run a comparison when the table is regenerated?
Not at the moment. I have a reminder in my calendar.

But maybe I can make a periodic GitHub action that downloads the EastAsianWidth.txt file and compares whether it's different.
This file start with
```
# EastAsianWidth-15.0.0.txt
# Date: 2022-05-24, 17:40:20 GMT [KW, LI]
```
which I assume will change with every Unicode release.

I don't want this in the CI and possibly fail the CI since updating the Unicode files might be non-trivial. I don't expect that for this case, but the grapheme clustering might change the rules and thus need changes in the code. (And I'm not sure how Zach's papers will affect what other parts of the Unicode database we need and how stable these rules are.)

@ldionne do you have an opinion? According to the Standard we should be using the latest Unicode Standard.
(This is part of P2736R2 and was voted in during the last plenary.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D144499/new/

https://reviews.llvm.org/D144499