[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.
Kai Nacke via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Oct 14 13:05:00 PDT 2021
Kai added inline comments.
================
Comment at: llvm/include/llvm/Support/CharSet.h:47
+ /// UTF-8 character set encoding.
+ CS_UTF8,
+
----------------
hubert.reinterpretcast wrote:
> Just noting that https://wg21.link/p1885 proposes enumeration names (and values) for these.
Good point, thanks. It makes a lot of sense to have the names similar to proposed standards.
================
Comment at: llvm/lib/Support/CharSet.cpp:30-35
+#define CSNAME(CS, STR) \
+ if (CSName == STR) \
+ return CS
+ CSNAME(CharSetConverter::CS_UTF8, "UTF-8");
+ CSNAME(CharSetConverter::CS_LATIN1, "ISO8859-1");
+ CSNAME(CharSetConverter::CS_IBM1047, "IBM-1047");
----------------
hubert.reinterpretcast wrote:
> There is a normalization process for character set names described by https://wg21.link/p1885 (please refer to the source material directly as well: https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching).
>
> `ISO8859-1` is neither the primary name nor one of the aliases defined for this charset in the IANA character set registry.
Thanks! I added the normalization algorithm, as I think that it makes using the converter easier.
Obviously, the `ISO8859-1` name comes from the z/OS implementation of iconv, which accepts 'ISO8859-1` or `819` but not the registered name `ISO-8859-1`.
================
Comment at: llvm/lib/Support/CharSet.cpp:102
+ if (Ch >= 128) {
+ // Only two-byte sequences can be decoded.
+ if (Ch != 0xc2 && Ch != 0xc3)
----------------
hubert.reinterpretcast wrote:
> The comment should say, "only valid sequences encoding UCS scalar values in the range [U+0080, U+00FF] can be decoded".
Changed.
================
Comment at: llvm/lib/Support/CharSet.cpp:113
+ Ch = Ch2 | (Ch << 6);
+ Length--;
+ }
----------------
hubert.reinterpretcast wrote:
> The coding guidelines say to use preincrement in cases where postincrement is not needed.
Thanks for catching - my fault.
================
Comment at: llvm/lib/Support/CharSet.cpp:155
+ OutputLength = Capacity - Used;
+ } else
+ // Some other error occured.
----------------
hubert.reinterpretcast wrote:
> The coding guidelines have been updated to discourage mixing use and omission of braces for the constituent parts of an if/else chain.
Thanks again, I was not aware of this change.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D88741/new/
https://reviews.llvm.org/D88741
More information about the llvm-commits
mailing list