[all-commits] [llvm/llvm-project] c86685: [libc++][format] Improves Unicode decoders.

Mark de Wever via All-commits all-commits at lists.llvm.org
Wed Mar 8 13:02:03 PST 2023


  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: c866855b42eb3e8aa7578aadb26e4431d1d71efd
      https://github.com/llvm/llvm-project/commit/c866855b42eb3e8aa7578aadb26e4431d1d71efd
  Author: Mark de Wever <koraq at xs4all.nl>
  Date:   2023-03-08 (Wed, 08 Mar 2023)

  Changed paths:
    M libcxx/include/__format/formatter_output.h
    M libcxx/include/__format/unicode.h
    M libcxx/test/std/utilities/format/format.functions/escaped_output.unicode.pass.cpp

  Log Message:
  -----------
  [libc++][format] Improves Unicode decoders.

During the implementation of P2286 a second Unicode decoder was added.
The original decoder was only used for the width estimation. Changing
an ill-formed Unicode sequence to the replacement character, works
properly for this use case. For P2286 an ill-formed Unicode sequence
needs to be formatted as a sequence of code units. The exact wording in
the Standard as a bit unclear and there was odd example in the WP. This
made it hard to use the same decoder. SG16 determined the odd example in
the WP was a bug and this has been fixed in the WP.

This made it possible to combine the two decoders. The P2286 decoder
kept track of the size of the ill-formed sequence. However this was not
needed since the output algorithm needs to keep track of size of a
well-formed and an ill-formed sequence. So this feature has been
removed.

The error status remains since it's needed for P2286, the grapheme
clustering can ignore this unneeded value. (In general, grapheme
clustering is only has specified behaviour for Unicode. When the string
is in a non-Unicode encoding there are no requirements. Ill-formed
Unicode is a non-Unicode encoding. Still libc++ does a best effort
estimation.)

There UTF-8 decoder accepted several ill-formed sequences:
- Values in the surrogate range U+D800..U+DFFF.
- Values encoded in more code units than required, for example 0+0020
  in theory can be encoded using 1, 2, 3, or 4 were accepted. This is
  not allowed by the Unicode Standard.
- Values larger than U+10FFFF were not always rejected.

Reviewed By: #libc, ldionne, tahonermann, Mordante

Differential Revision: https://reviews.llvm.org/D144346




More information about the All-commits mailing list