[libcxx-commits] [PATCH] D134036: [libc++][format] Implements string escaping.

Tom Honermann via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Mon Oct 3 09:50:58 PDT 2022


tahonermann added a comment.

Overall, this looks really good to me. I think I spotted an issue with handling of ill-formed UTF-16 though. The code also appears to assume that characters in the ASCII range have the same value for all `CharT` types. In general, that is not true. For Clang specifically that would not be true when targeting an EBCDIC-based platform (unless `wchar_t` is for a wide-EBCDIC variant in that case? But even then, there would be mismatches for `char8_t`, `char16_t` and `char32_t` though specification for those types is currently lacking).



================
Comment at: libcxx/include/__format/formatter_output.h:446
+  switch (__value) {
+  case _CharT('\t'):
+    __str += _LIBCPP_STATICALLY_WIDEN(_CharT, "\\t");
----------------
I think these case statements should use `_LIBCPP_STATICALLY_WIDEN` or similar as there is no guarantee that the `wchar_t` value for `\t` is the same as for `char`.


================
Comment at: libcxx/include/__format/formatter_output.h:517-519
+        _LIBCPP_ASSERT(__result.__ill_formed_size == 1, "for UTF-16 at most two code units per code point.");
+        for (int __i = 0; __i < __result.__ill_formed_size; __result.__value >>= 16, (void)++__i)
+          __formatter::__write_escape_ill_formed_code_unit(__str, __result.__value & 0xffff);
----------------
`__consume_p2286()` has a condition where a `__consume_p2286_result` value is returned with `__ill_formed_size` equal to 2 that would appear to violate the assertion here. The assertion of `1` is also at odds with the comment and the need for the following loop. I suspect this is intended to assert a value of either `1` or `2`. If so, this implies a missing test case.


================
Comment at: libcxx/include/__format/unicode.h:271-274
+          // Not a valid low surrogate pair
+          // store in invalid code units in reverse order
+          __result |= static_cast<char32_t>(*__first_) << 16;
+          return {2, __result};
----------------
If the second code unit is not in the low surrogate range, than it is necessarily in the BMP or high surrogate range in which case, it may begin the next character. This should probably be handled as a single ill-formed code unit (in which case, there is interaction with some of my other comments).


================
Comment at: libcxx/include/__format/unicode.h:274
+          __result |= static_cast<char32_t>(*__first_) << 16;
+          return {2, __result};
+        }
----------------
There is an assertion of `__ill_formed_size == 1` in `__escape()` for the `sizeof(CharT)==2` case that would appear to be violated here.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134036/new/

https://reviews.llvm.org/D134036



More information about the libcxx-commits mailing list