[all-commits] [llvm/llvm-project] 68c3d6: [libc++][format] Improves width estimate.

Thu Apr 20 12:18:57 PDT 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 68c3d66a97a0c1b7bb1b6b2b84640eece07bcc9c
      https://github.com/llvm/llvm-project/commit/68c3d66a97a0c1b7bb1b6b2b84640eece07bcc9c
  Author: Mark de Wever <koraq at xs4all.nl>
  Date:   2023-04-20 (Thu, 20 Apr 2023)

  Changed paths:
    M libcxx/docs/ReleaseNotes.rst
    M libcxx/docs/Status/Cxx2bPapers.csv
    M libcxx/docs/Status/FormatIssues.csv
    M libcxx/docs/UsingLibcxx.rst
    M libcxx/include/CMakeLists.txt
    M libcxx/include/__format/parser_std_format_spec.h
    A libcxx/include/__format/width_estimation_table.h
    M libcxx/include/module.modulemap.in
    M libcxx/test/libcxx/private_headers.verify.cpp
    A libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp
    M libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp
    M libcxx/utils/CMakeLists.txt
    A libcxx/utils/data/unicode/EastAsianWidth.txt
    M libcxx/utils/data/unicode/README.txt
    A libcxx/utils/generate_width_estimation_table.py

  Log Message:
  -----------
  [libc++][format] Improves width estimate.

As obvious from the paper's title this is an LWG issue and thus retroactively
applied to C++20. This change may the output for certain code points:
1 Considers 8477 extra codepoints as having a width 2 (as of Unicode 15)
  (mostly Tangut Ideographs)
2 Change the width of 85 unassigned code points from 2 to 1
3 Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER
  TEN ON BLACK SQUARE ... U+324F CIRCLED NUMBER EIGHTY ON BLACK
  SQUARE) from 2 to 1, because it seems questionable to make an exception
  for those without input from Unicode

Note that libc++ already uses Unicode 15, while the Standard requires Unicode 12.
(The last time I checked MSVC STL used Unicode 14.)

So in practice the only notable change is item 3.

Implements
  P2675 LWG3780: The Paper
  format's width estimation is too approximate and not forward compatible

Benchmark before these changes
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3928 ns         3928 ns       178131
BM_unicode_text<char>          75231 ns        75230 ns         9158
BM_cyrillic_text<char>         59837 ns        59834 ns        11529
BM_japanese_text<char>         39842 ns        39832 ns        17501
BM_emoji_text<char>             3931 ns         3930 ns       177750
BM_ascii_text<wchar_t>          4024 ns         4024 ns       174190
BM_unicode_text<wchar_t>       63756 ns        63751 ns        11136
BM_cyrillic_text<wchar_t>      44639 ns        44638 ns        15597
BM_japanese_text<wchar_t>      34425 ns        34424 ns        20283
BM_emoji_text<wchar_t>          3937 ns         3937 ns       177684

Benchmark after these changes
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3914 ns         3913 ns       178814
BM_unicode_text<char>          70380 ns        70378 ns         9694
BM_cyrillic_text<char>         51889 ns        51877 ns        13488
BM_japanese_text<char>         41707 ns        41705 ns        16723
BM_emoji_text<char>             3908 ns         3907 ns       177912
BM_ascii_text<wchar_t>          3949 ns         3948 ns       177525
BM_unicode_text<wchar_t>       64591 ns        64587 ns        10649
BM_cyrillic_text<wchar_t>      44089 ns        44078 ns        15721
BM_japanese_text<wchar_t>      39369 ns        39367 ns        17779
BM_emoji_text<wchar_t>          3936 ns         3934 ns       177821

Benchmarks without "if(__code_point < (__entries[0] >> 14))"
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_ascii_text<char>             3922 ns         3922 ns       178587
BM_unicode_text<char>          94474 ns        94474 ns         7351
BM_cyrillic_text<char>         69202 ns        69200 ns        10157
BM_japanese_text<char>         42735 ns        42692 ns        16382
BM_emoji_text<char>             3920 ns         3919 ns       178704
BM_ascii_text<wchar_t>          3951 ns         3950 ns       177224
BM_unicode_text<wchar_t>       81003 ns        80988 ns         8668
BM_cyrillic_text<wchar_t>      57020 ns        57018 ns        12048
BM_japanese_text<wchar_t>      39695 ns        39687 ns        17582
BM_emoji_text<wchar_t>          3977 ns         3976 ns       176479

This optimization does carry its weight for the Unicode and Cyrillic
test. For the Japanese tests the gains are minor and for emoji it seems
to have no effect.

Reviewed By: ldionne, tahonermann, #libc

Differential Revision: https://reviews.llvm.org/D144499