[libcxx-commits] [PATCH] D143349: [libc++] Fix UTF-8 decoding in codecvts. Fix #60177.

Sun Feb 19 07:00:18 PST 2023

Mordante added a comment.

I like @tahonermann's suggestion to test the edge cases.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:111-115
+      {6, 2, 3, 2}, // no space for third CP
+      {4, 3, 3, 2}, // incomplete third CP
+      {5, 3, 3, 2}, // incomplete third CP
+      {4, 2, 3, 2}, // incomplete third CP, and no space for it
+      {5, 2, 3, 2}, // incomplete third CP, and no space for it
----------------
dimztimz wrote:
> Mordante wrote:
> > I think this order of the test improves readability, same for the other tests.
> > 
> > Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".
> > 
> > Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.
> I think this is subjective, you can give arguments for few different orderings.
I agree it's subjective, it's just what feels easier for me.

I'm concerned that the test is not easy to understand, even with the comments.

I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:175
+
+      // replace first trailing byte with invalid byte
+      {3, 4, 1, 1, 0xFF, 2},
----------------
dimztimz wrote:
> Mordante wrote:
> > What's the difference between an ASCII byte and an invalid byte?
> > Both are just invalid due not having the bit pattern `10xxxxxx`, right?
> Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to `in()` to check if there is another valid sequence further in the string.
Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143349/new/

https://reviews.llvm.org/D143349