[libcxx-commits] [PATCH] D103413: [libc++][format] Implement Unicode support.

Sun Jun 27 08:32:47 PDT 2021

Mordante marked 6 inline comments as done.
Mordante added inline comments.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:669
+ * Unicode can be stored in several formats UTF-8, UTF-16, and UTF-32.
+ * Depending on format the relation between the number of characters stored and
+ * the number of output columns differs. The first relation is the number of
----------------
vitaut wrote:
> I think you mean code units and not characters. Same throughout the diff.
Correct.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:687-688
+ * - The simple scanner @ref __estimate_column_width_fast. This scanner assumes
+ *   1 character is 1 column. This scanner stops when it can't be sure the
+ *   assumption is valid:
+ *   - UTF-8 when the code point is encoded in more than 1 character.
----------------
vitaut wrote:
> I think a more robust error handling would be to count each invalid code unit as contributing 1 to the width. Terminals normally use replacement characters so it will give the correct result.
Not entirely sure what you mean. This algorithm is fast, but simple. The engine first uses this algorithm, when it "fails" it uses a more advanced algorithm. The more advanced algorithm implements the estimation as defined in the Standard.
Basically this algorithm's main purpose is to handle ASCII, when it's not ASCII the Standard's estimation is used.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:744-767
+  return 1 + (__c >= 0x1100 &&
+              (__c <= 0x115f ||
+               (__c >= 0x2100 &&
+                (__c <= 0x232a ||
+                 (__c >= 0x2e80 &&
+                  (__c <= 0x303e ||
+                   (__c >= 0x3040 &&
----------------
vitaut wrote:
> A sequence of ifs would likely be more readable.
Agreed. I originally had a sequence of `if`s, but that was a bit slow. I used this to speed things up. However I'll look whether I can use `if`s in a smarter fashion to improve readability.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:828-830
+ * @note The scanner will stop when it encounters malformed Unicode. Since our
+ * scanner isn't a full blown Unicode parser it stops instead of throwing an
+ * exception.
----------------
vitaut wrote:
> Not sure what you mean by "a full blown Unicode parser" but I don't think it's ever a good idea to throw exception on width estimation.
When I think about a "a full blown Unicode parser" I think of something like the ICU library.
I think here it makes sense to use your previous suggestion:
"I think a more robust error handling would be to count each invalid code unit as contributing 1 to the width. "

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:832
+ *
+ * @tparam __truncate When the last parsed code point is a multi-column output
+ *                    it may exceed the @a __threshold by one. The behaviour
----------------
vitaut wrote:
> Why not set `__threshold` to the maximum value to disable it instead of introducing another flag?
That can give the issue, where the result can be off by one. I'll see whether I can find a way to improve this code.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:994
+
+  _LIBCPP_ASSERT(__width != 0 || __precision != __format::__number_max,
+                 "The function has no effect and shouldn't be used");
----------------
vitaut wrote:
> What does the check `__precision != __format::__number_max` do here?
> 
> Note that `__number_max` is a valid precision.
This was based on my initial assumption 999.999.999 would be more than anybody would use for a width or precision. Since that assumption's wrong the test here needs to be adjusted.

================
Comment at: libcxx/include/__format/parser_std_format_spec.h:997
+  _LIBCPP_ASSERT(__width <= __precision,
+                 "The caller should validate this invariant");
+
----------------
As pointed out in D103425 the width can be larger than the precision. So this test needs to be removed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103413/new/

https://reviews.llvm.org/D103413