[PATCH] D114342: ConvertUTF, new wrapper API

Marcus Johnson via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Mar 21 15:09:31 PDT 2022


MarcusJohnson91 added inline comments.


================
Comment at: llvm/lib/Support/ConvertUTFWrapper.cpp:172
+  // enough that we can fit a null terminator without reallocating.
+  Out.resize(SrcBytes.size() + 1);
+  UTF8 *Dst = reinterpret_cast<UTF8 *>(&Out[0]);
----------------
cor3ntin wrote:
> Bigcheese wrote:
> > This is technically correct, but it's implicit in that the max number of UTF8 code units per code point is the same as `sizeof(UTF32)`. Would be nice to have a comment.
> Nit: The comment still doesn't say that we assume there can only be 4 bytes per utf-8 code units - which would not be the case if the utf-8 comes for non-iso10646 conforming android environments for example
I was confused by this, Unicode limits Codepoints to 0x10FFFF so the maximum number of UTF-8 codeunits is 4.

I mean, I can still put the comment in, but it seems pointless?

https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#I1.36559

This limit, of 0x10FFFF has been in place since the year 2000

https://www.unicode.org/L2/L2000/00079-n2175.htm


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114342/new/

https://reviews.llvm.org/D114342



More information about the llvm-commits mailing list