[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.

Hubert Tong via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Oct 26 20:04:52 PDT 2021


hubert.reinterpretcast added inline comments.


================
Comment at: llvm/lib/Support/CharSet.cpp:134
+        } else {
+          // The UCS scalar value is encoded with more than 2 bytes. In this
+          // case, the value cannot be represented in the 1 byte Latin1/EBCDIC
----------------
The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the `iconv` path to check the return code) would save effort on getting the error handling right.

It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).


================
Comment at: llvm/lib/Support/CharSet.cpp:138
+          // the trailing byte sequence is skipped.
+          unsigned SkipLen = countLeadingOnes(Ch);
+          if (SkipLen > 4)
----------------
I suggest renaming this to `SeqLength` and changing the comment above to say "the rest of the sequence is skipped" (if indeed that aspect survives the changes to either implement error detection or abandon lossy translation).


================
Comment at: llvm/lib/Support/CharSet.cpp:139
+          unsigned SkipLen = countLeadingOnes(Ch);
+          if (SkipLen > 4)
+            return std::make_error_code(std::errc::illegal_byte_sequence);
----------------
An error should also be flagged when `SkipLen == 1`.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88741/new/

https://reviews.llvm.org/D88741



More information about the llvm-commits mailing list