[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.

Thu Oct 14 13:05:00 PDT 2021

Kai added inline comments.

================
Comment at: llvm/include/llvm/Support/CharSet.h:47
+    /// UTF-8 character set encoding.
+    CS_UTF8,
+
----------------
hubert.reinterpretcast wrote:
> Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.
Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.

================
Comment at: llvm/lib/Support/CharSet.cpp:30-35
+#define CSNAME(CS, STR)                                                        \
+  if (CSName == STR)                                                           \
+  return CS
+  CSNAME(CharSetConverter::CS_UTF8, "UTF-8");
+  CSNAME(CharSetConverter::CS_LATIN1, "ISO8859-1");
+  CSNAME(CharSetConverter::CS_IBM1047, "IBM-1047");
----------------
hubert.reinterpretcast wrote:
> There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).
> 
> `ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.
Thanks! I added the normalization algorithm, as I think that it makes using the converter easier.

Obviously, the `ISO8859-1` name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or `819` but not the registered name `ISO-8859-1`. 

================
Comment at: llvm/lib/Support/CharSet.cpp:102
+      if (Ch >= 128) {
+        // Only two-byte sequences can be decoded.
+        if (Ch != 0xc2 && Ch != 0xc3)
----------------
hubert.reinterpretcast wrote:
> The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".
Changed.

================
Comment at: llvm/lib/Support/CharSet.cpp:113
+        Ch = Ch2 | (Ch << 6);
+        Length--;
+      }
----------------
hubert.reinterpretcast wrote:
> The coding guidelines say to use preincrement in cases where postincrement is not needed.
Thanks for catching - my fault.

================
Comment at: llvm/lib/Support/CharSet.cpp:155
+      OutputLength = Capacity - Used;
+    } else
+      // Some other error occured.
----------------
hubert.reinterpretcast wrote:
> The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.
Thanks again, I was not aware of this change.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88741/new/

https://reviews.llvm.org/D88741