[libcxx-commits] [PATCH] D143349: [libc++] Fix UTF-8 decoding in codecvts. Fix #60177.

Sun Feb 12 05:48:58 PST 2023

Mordante added a comment.

Sorry for the late review, but I was quite busy last week.
Thanks a lot for fixes. I really like the additional unit tests!

Several minor issues with the patch.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:21
+struct test_offsets_ok {
+  size_t in_size, out_size;
+};
----------------
Please have one declaration per line.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:34
+
+#define array_size(x) (sizeof(x) / sizeof(x)[0])
+
----------------
Can you use `std::array` instead? This is available on all platforms where we support C++03.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:70
+
+  for (auto t : offsets) {
+    InternT out[array_size(exp)] = {};
----------------
What is the difference between this part of the test and the one on line 52? Please add some comments.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:111-115
+      {6, 2, 3, 2}, // no space for third CP
+      {4, 3, 3, 2}, // incomplete third CP
+      {5, 3, 3, 2}, // incomplete third CP
+      {4, 2, 3, 2}, // incomplete third CP, and no space for it
+      {5, 2, 3, 2}, // incomplete third CP, and no space for it
----------------
I think this order of the test improves readability, same for the other tests.

Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".

Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:148
+template <class InternT, class ExternT>
+void utf8_to_utf32_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {
+  // UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP
----------------
I think it would be good to test a few more corner cases in this test.
- input values in the surrogate range (U+D800 to U+DBFF and U+DC00 to U+DFFF)
- outside the valid range > U+10FFFF

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:175
+
+      // replace first trailing byte with invalid byte
+      {3, 4, 1, 1, 0xFF, 2},
----------------
What's the difference between an ASCII byte and an invalid byte?
Both are just invalid due not having the bit pattern `10xxxxxx`, right?

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:706
+  test_offsets_error<InternT> offsets[] = {
+      {5, 10, 0, 0, 0xD800, 0},
+      {5, 10, 0, 0, 0xDBFF, 0},
----------------
Can you make sure all these blocks have comments.
The tests are not to easy to read, without comment I really have hard time to validate the test.

Especially since you use the surrogate values here, are you testing the surrogate values fail, or that the input is malformed in other ways.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143349/new/

https://reviews.llvm.org/D143349