[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.

Tue Oct 12 16:37:46 PDT 2021

hubert.reinterpretcast added inline comments.

================
Comment at: llvm/include/llvm/Support/CharSet.h:47
+    /// UTF-8 character set encoding.
+    CS_UTF8,
+
----------------
Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.

================
Comment at: llvm/lib/Support/CharSet.cpp:30-35
+#define CSNAME(CS, STR)                                                        \
+  if (CSName == STR)                                                           \
+  return CS
+  CSNAME(CharSetConverter::CS_UTF8, "UTF-8");
+  CSNAME(CharSetConverter::CS_LATIN1, "ISO8859-1");
+  CSNAME(CharSetConverter::CS_IBM1047, "IBM-1047");
----------------
There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).

`ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.

================
Comment at: llvm/lib/Support/CharSet.cpp:102
+      if (Ch >= 128) {
+        // Only two-byte sequences can be decoded.
+        if (Ch != 0xc2 && Ch != 0xc3)
----------------
The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".

================
Comment at: llvm/lib/Support/CharSet.cpp:104
+        if (Ch != 0xc2 && Ch != 0xc3)
+          return std::make_error_code(std::errc::illegal_byte_sequence);
+        // Is buffer truncated?
----------------
The API contract between the table conversion and the `iconv` conversion is inconsistent. `iconv` conversion performs an implementation-defined conversion for valid input characters that does not have a representation in the output codeset. This implementation fails the conversion instead.

================
Comment at: llvm/lib/Support/CharSet.cpp:113
+        Ch = Ch2 | (Ch << 6);
+        Length--;
+      }
----------------
The coding guidelines say to use preincrement in cases where postincrement is not needed.

================
Comment at: llvm/lib/Support/CharSet.cpp:140
+  size_t Capacity = Result.capacity();
+  Result.resize(Capacity);
+  char *Output = static_cast<char *>(Result.data());
----------------
This looks like a use case for `resize_for_overwrite`.

================
Comment at: llvm/lib/Support/CharSet.cpp:151
+      const size_t Used = Capacity - OutputLength;
+      Capacity *= 2;
+      Result.resize(Capacity);
----------------
Missing overflow detection for when the most significant bit of `Capacity` is already set.

================
Comment at: llvm/lib/Support/CharSet.cpp:152
+      Capacity *= 2;
+      Result.resize(Capacity);
+      Output = static_cast<char *>(Result.data()) + Used;
----------------
Same comment about `resize_for_overwrite`.

================
Comment at: llvm/lib/Support/CharSet.cpp:155
+      OutputLength = Capacity - Used;
+    } else
+      // Some other error occured.
----------------
The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.

================
Comment at: llvm/lib/Support/CharSet.cpp:194
+  iconv_t ConvDesc = iconv_open(CSTo.str().c_str(), CSFrom.str().c_str());
+  if (ConvDesc == reinterpret_cast<iconv_t>(-1))
+    return std::error_code(errno, std::generic_category());
----------------
Use a C-style cast if necessary, but `reinterpret_cast` is wrong here.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88741/new/

https://reviews.llvm.org/D88741