[libcxx-commits] [PATCH] D126971: [libc++] Implements Unicode grapheme clustering

Wed Jul 13 09:28:13 PDT 2022

Mordante marked 7 inline comments as done.
Mordante added a comment.

Thanks for the reviews!

================
Comment at: libcxx/test/libcxx/utilities/format/format.string/format.string.std/extended_grapheme_cluster.pass.cpp:79-81
+    test(data_utf16);
+  else
+    test(data_utf32);
----------------
vitaut wrote:
> Maybe generate only one of `data_utf16` and `data_utf32` if the other is not used? Or are they both used?
They are both used. Some of our supported platforms have `sizeof(wchar_t) == 16`, for example Windows and AIX.

================
Comment at: libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp:55
+  //*** 1-byte code points ***
+  check.template operator()<"{:*^3}">(SV("* *"), SV(" "));
+  check.template operator()<"{:*^3}">(SV("*~*"), SV("~"));
----------------
vitaut wrote:
> The `template operator()<` part make the tests a bit difficult to parse. I would recommend passing the format string as a normal argument and not as a template parameter.
The issue is the string is used as both as `char` and `wchar_t`. In the past there was no other way to do this, at that time an approach like `STATICALLY-WIDEN` failed. However `STATICALLY-WIDEN` now seems to work with all supported compilers. However I used this approach in a lot of tests so I rather "fix" them in one go. So I'll add a todo to my list.

================
Comment at: libcxx/utils/generate_extended_grapheme_cluster_table.py:1
+#!/usr/bin/env python
+# ===----------------------------------------------------------------------===##
----------------
ldionne wrote:
> According to what you just said during live review, https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakTest.txt changes from time to time (roughly once a year). Consequently, I think we'll want to make it as easy as possible to update. As discussed, we also don't want to get automatic updates under our feet without manual intervention, because the rules might need to change, etc. Hence, I would suggest:
> 
> 1. We download the tables that you use to generate headers and tests and check them into libc++ in a reasonable location. Preferably in a subdirectory with a `README.md` that explains what those are and where to download them from.
> 2. We add the running of the generation scripts to `libcxx-generate-files`.
> 
> That way, the workflow for updating those files is simply to re-download the latest tables online, re-generate the headers+tests using `libcxx-generate-files`, and check that in. Furthermore, if we ever forget to re-generate the headers and tests, the CI will tell us through our existing mechanisms.
I've given this some more thought. Instead of making these changes in this commit I will work on a follow-up commit to make these changes. I don't see the easy way to update as a must-have for LLVM-15 since I don't intend to update these tables in the release branch. That way there's less risk for this patch to miss the branching point.

================
Comment at: libcxx/utils/generate_extended_grapheme_cluster_table.py:97-98
+/// The data is generated from
+/// - https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
+/// - https://www.unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
+///
----------------
ldionne wrote:
> Those should be in sync!
> 
> Also, I would recommend using the latest version of these files, even though the standard uses 12.0 in its bibliography.
I actually used 12 for both but forgot to update this link from the Microsoft code.

But since we want to use the latest version I use the Unicode links to the latest version. That way we don't need to update the links when we update the tables.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126971/new/

https://reviews.llvm.org/D126971