[libcxx-commits] [PATCH] D150031: [libc++][format] Adds a UTF transcoder.

Mon Jun 5 14:25:14 PDT 2023

tahonermann requested changes to this revision.
tahonermann added a comment.
This revision now requires changes to proceed.

The approach looks pretty good though I don't love the encoding inference based on `sizeof()`. I see the motivation though. I'll think about this some more, but I think we should just infer the encoding based on the character type; `char16_t` => UTF-16, `char32_t` => UTF-32, `wchar_t` => platform dependent.

================
Comment at: libcxx/include/print:44-48
+  requires(sizeof(iter_value_t<_OutIt>) == 2) &&
+          (output_iterator<_OutIt, const char16_t&>
+#    ifndef _LIBCPP_HAS_NO_WIDE_CHARACTERS
+           || output_iterator<_OutIt, const wchar_t&>
+#    endif
----------------
The underlying type of `char16_t` is `uint_least16_t`, so the `sizeof` constraint is technically incorrect here. Similarly, a size of 2 for `wchar_t` doesn't imply UTF-16 if `CHAR_BIT` is 16 (but that is only the case for a few embedded platforms).

We'll need to see how P2728R0 eventually shakes out, but I think any inference based on character type size should be avoided in favor of an approach that states the desired encoding; e.g., `__encode_utf16(...)`. Constraints could then be added to ensure that the element size is at least 16-bit (`sizeof(T) * CHAR_BIT`).

================
Comment at: libcxx/include/print:51
+_LIBCPP_HIDE_FROM_ABI constexpr void __encode(_OutIt& __out_it, char32_t __value) {
+  _LIBCPP_ASSERT(__is_scalar_value(__value), "an invalud unicode scalar value results in invalid UTF-16");
+
----------------

================
Comment at: libcxx/include/print:71
+_LIBCPP_HIDE_FROM_ABI constexpr void __encode(_OutIt& __out_it, char32_t __value) {
+  _LIBCPP_ASSERT(__is_scalar_value(__value), "an invalud unicode scalar value results in invalid UTF-32");
+  *__out_it++ = __value;
----------------

================
Comment at: libcxx/include/print:78
+_LIBCPP_HIDE_FROM_ABI constexpr _OutIt __transcode(_InIt __first, _InIt __last, _OutIt __out_it) {
+  // The __code_point_view has a basic_string_view interface.
+  // When transcoding becomes part of the standard we probably want to
----------------
`basic_string_view` or `span`? I would expect more like the latter.

================
Comment at: libcxx/include/print:81-83
+  // For example when processing a 1 or 2 code unit UTF-8 code point the
+  // result will always be a encoded to 1 UTF-16 code unit. (The
+  // surrogate range requires 3 code units.)
----------------

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150031/new/

https://reviews.llvm.org/D150031