[PATCH] D107202: ConvertUTF: convertUTF32ToUTF8String

Eli Friedman via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Aug 3 11:01:28 PDT 2021


efriedma added a comment.

In D107202#2921107 <https://reviews.llvm.org/D107202#2921107>, @MarcusJohnson91 wrote:

> What BOM handling? there is no BOM function, bytes are swapped in the converter if the byte order isn't correct, is that what you mean?

I mean the behavior handling strings that contain UNI_UTF32_BYTE_ORDER_MARK_SWAPPED.

I suspect a lot of places don't want the BOM handling to trigger.  This includes trying to print diagnostics for wprintf, since the underlying function doesn't have any BOM handling.  But I guess it's unlikely to matter in practice.

In D107202#2921107 <https://reviews.llvm.org/D107202#2921107>, @MarcusJohnson91 wrote:

> I copied  `SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` from the UTF-16 version.
>
> Are you asking me to change the UTF-16 version too?



In D106753#inline-1020607 <https://reviews.llvm.org/D106753#inline-1020607>, @efriedma wrote:

> I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16.  A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8.  That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8.  A codepoint in UTF-8 is at most 4 bytes.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107202/new/

https://reviews.llvm.org/D107202



More information about the llvm-commits mailing list