[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.
Hubert Tong via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 12 16:37:46 PDT 2021
hubert.reinterpretcast added inline comments.
================
Comment at: llvm/include/llvm/Support/CharSet.h:47
+ /// UTF-8 character set encoding.
+ CS_UTF8,
+
----------------
Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.
================
Comment at: llvm/lib/Support/CharSet.cpp:30-35
+#define CSNAME(CS, STR) \
+ if (CSName == STR) \
+ return CS
+ CSNAME(CharSetConverter::CS_UTF8, "UTF-8");
+ CSNAME(CharSetConverter::CS_LATIN1, "ISO8859-1");
+ CSNAME(CharSetConverter::CS_IBM1047, "IBM-1047");
----------------
There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).
`ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.
================
Comment at: llvm/lib/Support/CharSet.cpp:102
+ if (Ch >= 128) {
+ // Only two-byte sequences can be decoded.
+ if (Ch != 0xc2 && Ch != 0xc3)
----------------
The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".
================
Comment at: llvm/lib/Support/CharSet.cpp:104
+ if (Ch != 0xc2 && Ch != 0xc3)
+ return std::make_error_code(std::errc::illegal_byte_sequence);
+ // Is buffer truncated?
----------------
The API contract between the table conversion and the `iconv` conversion is inconsistent. `iconv` conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.
================
Comment at: llvm/lib/Support/CharSet.cpp:113
+ Ch = Ch2 | (Ch << 6);
+ Length--;
+ }
----------------
The coding guidelines say to use preincrement in cases where postincrement is not needed.
================
Comment at: llvm/lib/Support/CharSet.cpp:140
+ size_t Capacity = Result.capacity();
+ Result.resize(Capacity);
+ char *Output = static_cast<char *>(Result.data());
----------------
This looks like a use case for `resize_for_overwrite`.
================
Comment at: llvm/lib/Support/CharSet.cpp:151
+ const size_t Used = Capacity - OutputLength;
+ Capacity *= 2;
+ Result.resize(Capacity);
----------------
Missing overflow detection for when the most significant bit of `Capacity` is already set.
================
Comment at: llvm/lib/Support/CharSet.cpp:152
+ Capacity *= 2;
+ Result.resize(Capacity);
+ Output = static_cast<char *>(Result.data()) + Used;
----------------
Same comment about `resize_for_overwrite`.
================
Comment at: llvm/lib/Support/CharSet.cpp:155
+ OutputLength = Capacity - Used;
+ } else
+ // Some other error occured.
----------------
The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.
================
Comment at: llvm/lib/Support/CharSet.cpp:194
+ iconv_t ConvDesc = iconv_open(CSTo.str().c_str(), CSFrom.str().c_str());
+ if (ConvDesc == reinterpret_cast<iconv_t>(-1))
+ return std::error_code(errno, std::generic_category());
----------------
Use a C-style cast if necessary, but `reinterpret_cast` is wrong here.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D88741/new/
https://reviews.llvm.org/D88741
More information about the llvm-commits
mailing list