[PATCH] D106753: ConvertUTF: Created wrapper convertUTF32ToUTF8String

Fri Jul 30 14:20:55 PDT 2021

efriedma added inline comments.

================
Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:176
+  // enough that we can fit a null terminator without reallocating.
+  Out.resize(SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1);
+  UTF8 *Dst = reinterpret_cast<UTF8 *>(&Out[0]);
----------------
MarcusJohnson91 wrote:
> efriedma wrote:
> > `SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` seems like way too much memory.
> I copied that from the UTF16 code
I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16.  A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8.  That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8.  A codepoint in UTF-8 is at most 4 bytes.

================
Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:168
+  std::vector<UTF32> ByteSwapped;
+  if (Src[0] == UNI_UTF16_BYTE_ORDER_MARK_SWAPPED) {
+    ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
----------------
efriedma wrote:
> MarcusJohnson91 wrote:
> > efriedma wrote:
> > > Wrong constant.
> > > 
> > > Is this really the function you want to be using from clang?  I don't really understand why you'd want to handle byte order marks.
> > I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the UTF-32 interface.
> The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings.  We don't want to byteswap `L"\uFFFE%s"` or something like that.
Any thoughts on this?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106753/new/

https://reviews.llvm.org/D106753