[libcxx-commits] [PATCH] D144346: [libc++][format] Improves Unicode decoders.

Tue Feb 21 08:24:15 PST 2023

Mordante created this revision.
Herald added a project: All.
Mordante updated this revision to Diff 498678.
Mordante added a comment.
Mordante updated this revision to Diff 498868.
Mordante published this revision for review.
Herald added a project: libc++.
Herald added a subscriber: libcxx-commits.
Herald added a reviewer: libc++.

GCC 11 fixes.

Mordante added a comment.

Polishing before up for review.

During the implementation of P2286 <https://reviews.llvm.org/P2286> a second Unicode decoder was added.
The original decoder was only used for the width estimation. Changing
an ill-formed Unicode sequence to the replacement character, works
properly for this use case. For P2286 <https://reviews.llvm.org/P2286> an ill-formed Unicode sequence
needs to be formatted as a sequence of code units. The exact wording in
the Standard as a bit unclear and there was odd example in the WP. This
made it hard to use the same decoder. SG16 determined the odd example in
the WP was a bug and this has been fixed in the WP.

This made it possible to combine the two decoders. The P2286 <https://reviews.llvm.org/P2286> decoder
kept track of the size of the ill-formed sequence. However this was not
needed since the output algorithm needs to keep track of size of a
well-formed and an ill-formed sequence. So this feature has been
removed.

The error status remains since it's needed for P2286 <https://reviews.llvm.org/P2286>, the grapheme
clustering can ignore this unneeded value. (In general, grapheme
clustering is only has specified behaviour for Unicode. When the string
is in a non-Unicode encoding there are no requirements. Ill-formed
Unicode is a non-Unicode encoding. Still libc++ does a best effort
estimation.)

There UTF-8 decoder accepted several ill-formed sequences:

- Values in the surrogate range U+D800 <https://reviews.llvm.org/D800>..U+DFFF.
- Values encoded in more code units than required, for example 0+0020 in theory can be encoded using 1, 2, 3, or 4 were accepted. This is not allowed by the Unicode Standard.
- Values larger than U+10FFFF were not always rejected.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D144346

Files:
  libcxx/include/__format/formatter_output.h
  libcxx/include/__format/unicode.h
  libcxx/test/std/utilities/format/format.functions/escaped_output.unicode.pass.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D144346.498868.patch
Type: text/x-patch
Size: 24969 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libcxx-commits/attachments/20230221/f46f9f45/attachment-0001.bin>