[PATCH] D106753: ConvertUTF: Created wrapper convertUTF32ToUTF8String
Eli Friedman via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Fri Jul 30 14:20:55 PDT 2021
efriedma added inline comments.
================
Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:176
+ // enough that we can fit a null terminator without reallocating.
+ Out.resize(SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1);
+ UTF8 *Dst = reinterpret_cast<UTF8 *>(&Out[0]);
----------------
MarcusJohnson91 wrote:
> efriedma wrote:
> > `SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` seems like way too much memory.
> I copied that from the UTF16 code
I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.
================
Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:168
+ std::vector<UTF32> ByteSwapped;
+ if (Src[0] == UNI_UTF16_BYTE_ORDER_MARK_SWAPPED) {
+ ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
----------------
efriedma wrote:
> MarcusJohnson91 wrote:
> > efriedma wrote:
> > > Wrong constant.
> > >
> > > Is this really the function you want to be using from clang? I don't really understand why you'd want to handle byte order marks.
> > I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the UTF-32 interface.
> The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap `L"\uFFFE%s"` or something like that.
Any thoughts on this?
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D106753/new/
https://reviews.llvm.org/D106753
More information about the cfe-commits
mailing list