[libcxx-commits] [PATCH] D103413: [libc++][format] Implement Unicode support.

Fri Jun 18 07:56:31 PDT 2021

vitaut requested changes to this revision.
vitaut added inline comments.
This revision now requires changes to proceed.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:669
+ * Unicode can be stored in several formats UTF-8, UTF-16, and UTF-32.
+ * Depending on format the relation between the number of characters stored and
+ * the number of output columns differs. The first relation is the number of
----------------
I think you mean code units and not characters. Same throughout the diff.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:679
+ *   characters.
+ * - UTF-16: The number of character is always one.
+ *
----------------
16 ->32

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:687-688
+ * - The simple scanner @ref __estimate_column_width_fast. This scanner assumes
+ *   1 character is 1 column. This scanner stops when it can't be sure the
+ *   assumption is valid:
+ *   - UTF-8 when the code point is encoded in more than 1 character.
----------------
I think a more robust error handling would be to count each invalid code unit as contributing 1 to the width. Terminals normally use replacement characters so it will give the correct result.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:744-767
+  return 1 + (__c >= 0x1100 &&
+              (__c <= 0x115f ||
+               (__c >= 0x2100 &&
+                (__c <= 0x232a ||
+                 (__c >= 0x2e80 &&
+                  (__c <= 0x303e ||
+                   (__c >= 0x3040 &&
----------------
A sequence of ifs would likely be more readable.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:828-830
+ * @note The scanner will stop when it encounters malformed Unicode. Since our
+ * scanner isn't a full blown Unicode parser it stops instead of throwing an
+ * exception.
----------------
Not sure what you mean by "a full blown Unicode parser" but I don't think it's ever a good idea to throw exception on width estimation.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:832
+ *
+ * @tparam __truncate When the last parsed code point is a multi-column output
+ *                    it may exceed the @a __threshold by one. The behaviour
----------------
Why not set `__threshold` to the maximum value to disable it instead of introducing another flag?

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:851
+ *
+ * @returns A pair:
+ * - Number of columns.
----------------
I recommend returning a struct with meaningful field names.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:994
+
+  _LIBCPP_ASSERT(__width != 0 || __precision != __format::__number_max,
+                 "The function has no effect and shouldn't be used");
----------------
What does the check `__precision != __format::__number_max` do here?

Note that `__number_max` is a valid precision.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103413/new/

https://reviews.llvm.org/D103413