[libcxx-commits] [PATCH] D143349: [libc++] Fix UTF-8 decoding in codecvts. Fix #60177.

Wed Mar 1 11:23:15 PST 2023

dimztimz marked 3 inline comments as done.
dimztimz added inline comments.

================
Comment at: libcxx/test/std/localization/codecvt_unicode.pass.cpp:175
+
+      // replace first trailing byte with invalid byte
+      {3, 4, 1, 1, 0xFF, 2},
----------------
dimztimz wrote:
> Mordante wrote:
> > dimztimz wrote:
> > > Mordante wrote:
> > > > What's the difference between an ASCII byte and an invalid byte?
> > > > Both are just invalid due not having the bit pattern `10xxxxxx`, right?
> > > Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to `in()` to check if there is another valid sequence further in the string.
> > Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)
> I did not understand you here.
I added additional comments here with my latest patch and I think it explains the situation much better.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143349/new/

https://reviews.llvm.org/D143349