[libcxx-commits] [PATCH] D143349: [libc++] Fix UTF-8 decoding in codecvts. Fix #60177.

Mon Feb 13 04:16:45 PST 2023

dimztimz added inline comments.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:34
+
+#define array_size(x) (sizeof(x) / sizeof(x)[0])
+
----------------
Mordante wrote:
> Can you use `std::array` instead? This is available on all platforms where we support C++03.
`std::array` is not a good fit in this case for three reasons:

  # There is no inference of size.
  # Does not play as well with string literals.
  # Most importantly, in C++03 the member function `size()` is not constexpr.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:70
+
+  for (auto t : offsets) {
+    InternT out[array_size(exp)] = {};
----------------
Mordante wrote:
> What is the difference between this part of the test and the one on line 52? Please add some comments.
This one calls with the full out-buffer, see bellow `out, std::end(out)`.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:111-115
+      {6, 2, 3, 2}, // no space for third CP
+      {4, 3, 3, 2}, // incomplete third CP
+      {5, 3, 3, 2}, // incomplete third CP
+      {4, 2, 3, 2}, // incomplete third CP, and no space for it
+      {5, 2, 3, 2}, // incomplete third CP, and no space for it
----------------
Mordante wrote:
> I think this order of the test improves readability, same for the other tests.
> 
> Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".
> 
> Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.
I think this is subjective, you can give arguments for few different orderings.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:175
+
+      // replace first trailing byte with invalid byte
+      {3, 4, 1, 1, 0xFF, 2},
----------------
Mordante wrote:
> What's the difference between an ASCII byte and an invalid byte?
> Both are just invalid due not having the bit pattern `10xxxxxx`, right?
Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to `in()` to check if there is another valid sequence further in the string.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143349/new/

https://reviews.llvm.org/D143349