[libcxx-commits] [PATCH] D126971: [libc++] Implements Unicode grapheme clustering

Louis Dionne via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Tue Jul 12 09:38:35 PDT 2022


ldionne added a comment.

I have some comments. Being a total unicode newbie, this is mostly out of my depth, but the general approach looks sensible to me. I'll have a closer look at https://www.unicode.org/reports/tr29/tr29-35.html#Grapheme_Cluster_Boundaries to provide a more meaningful review, but this basically LGTM via the assumption that the test coverage would catch issues in the implementation.



================
Comment at: libcxx/.clang-format:68
 MaxEmptyLinesToKeep: 1
-NamespaceIndentation: Inner
 PackConstructorInitializers: NextLine
----------------
This should be part of a separate patch.


================
Comment at: libcxx/include/__format/parser_std_format_spec.h:902
+}
+#  else
+template <class _CharT>
----------------
`// _LIBCPP_HAS_NO_UNICODE`? This is quite a long `#if` block.


================
Comment at: libcxx/utils/generate_extended_grapheme_cluster_table.py:1
+#!/usr/bin/env python
+# ===----------------------------------------------------------------------===##
----------------
According to what you just said during live review, https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakTest.txt changes from time to time (roughly once a year). Consequently, I think we'll want to make it as easy as possible to update. As discussed, we also don't want to get automatic updates under our feet without manual intervention, because the rules might need to change, etc. Hence, I would suggest:

1. We download the tables that you use to generate headers and tests and check them into libc++ in a reasonable location. Preferably in a subdirectory with a `README.md` that explains what those are and where to download them from.
2. We add the running of the generation scripts to `libcxx-generate-files`.

That way, the workflow for updating those files is simply to re-download the latest tables online, re-generate the headers+tests using `libcxx-generate-files`, and check that in. Furthermore, if we ever forget to re-generate the headers and tests, the CI will tell us through our existing mechanisms.


================
Comment at: libcxx/utils/generate_extended_grapheme_cluster_table.py:97-98
+/// The data is generated from
+/// - https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
+/// - https://www.unicode.org/Public/14.0.0/ucd/emoji/emoji-data.txt
+///
----------------
Those should be in sync!

Also, I would recommend using the latest version of these files, even though the standard uses 12.0 in its bibliography.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126971/new/

https://reviews.llvm.org/D126971



More information about the libcxx-commits mailing list