[libcxx-commits] [PATCH] D143349: [libc++] Fix UTF-8 decoding in codecvts. Fix #60177.

Mon Feb 20 13:07:08 PST 2023

dimztimz added a comment.

In D143349#4137640 <https://reviews.llvm.org/D143349#4137640>, @Mordante wrote:

> I like @tahonermann's suggestion to test the edge cases.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:111-115
+      {6, 2, 3, 2}, // no space for third CP
+      {4, 3, 3, 2}, // incomplete third CP
+      {5, 3, 3, 2}, // incomplete third CP
+      {4, 2, 3, 2}, // incomplete third CP, and no space for it
+      {5, 2, 3, 2}, // incomplete third CP, and no space for it
----------------
Mordante wrote:
> dimztimz wrote:
> > Mordante wrote:
> > > I think this order of the test improves readability, same for the other tests.
> > > 
> > > Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".
> > > 
> > > Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.
> > I think this is subjective, you can give arguments for few different orderings.
> I agree it's subjective, it's just what feels easier for me.
> 
> I'm concerned that the test is not easy to understand, even with the comments.
> 
> I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding.
The problem lies in the specification for `std::codecvt` it is underspecified and hard to understand. Everyone will have the same hard time and there is no way around it. One has to reread the specs multiple times and after that the tests should be easier to read.

Maybe I can add here more comments, what do you think? Or you want me to change the order of the test cases?

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:175
+
+      // replace first trailing byte with invalid byte
+      {3, 4, 1, 1, 0xFF, 2},
----------------
Mordante wrote:
> dimztimz wrote:
> > Mordante wrote:
> > > What's the difference between an ASCII byte and an invalid byte?
> > > Both are just invalid due not having the bit pattern `10xxxxxx`, right?
> > Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to `in()` to check if there is another valid sequence further in the string.
> Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)
I did not understand you here.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143349/new/

https://reviews.llvm.org/D143349