[PATCH] D88741: [SystemZ/z/OS] Add utility class for char set conversion.

Wed Nov 3 12:53:37 PDT 2021

Kai added inline comments.

================
Comment at: llvm/include/llvm/Support/CharSet.h:98
+  /// \param[in] Source source string
+  /// \param[in,out] Result container for converted string
+  /// \return error code in case something went wrong
----------------
hubert.reinterpretcast wrote:
> The expectations around null termination and shift state should be documented. Note that https://wg21.link/p2029 recommends that numeric escapes do not trigger changes in shift state, so avoiding automatic transitions to the initial shift state at the end of the (potential mid-string) translation is probably the right thing to do. Using a null `StringRef` as the method to cause a return to the initial shift state would be consistent with the `iconv` interface.
I updated the comment (and also the implementation) that an empty string resets the shift state. However, this feels a bit clumsy, and leaks a lot of the details of the iconv interface. So I wonder if a new method, e.g. `resetState()` would not be the better approach.

================
Comment at: llvm/lib/Support/CharSet.cpp:134
+        } else {
+          // The UCS scalar value is encoded with more than 2 bytes. In this
+          // case, the value cannot be represented in the 1 byte Latin1/EBCDIC
----------------
hubert.reinterpretcast wrote:
> The comment should say that the encoded value is greater than 0xFF or the encoding is overlong. Overlong encodings are still an error. If the conversion facility has no use case for "lossy" translation, then reverting to the previous code for the UTF-8 to ISO-8859-1 decoding (and fixing the `iconv` path to check the return code) would save effort on getting the error handling right.
> 
> It is also an error if the encoded value ends up being not a UCS scalar value (either because it is greater then 0x10FFFF or because it encodes a surrogate code point).
Ah, I see your point. I reverted the error handling here, and check the return code from `iconv()` instead below. There is no real use case for it. I 

================
Comment at: llvm/lib/Support/CharSet.cpp:178
+  size_t Capacity = Result.capacity();
+  Result.resize(Capacity);
+  char *Output = static_cast<char *>(Result.data());
----------------
hubert.reinterpretcast wrote:
> My first comment regarding `resize_for_overwrite` was for this line.
Sorry, I noted this later, too.

================
Comment at: llvm/lib/Support/CharSet.cpp:212
+        [](StringRef Source, SmallVectorImpl<char> &Result) {
+          Result.append(Source.begin(), Source.end());
+          return std::error_code();
----------------
hubert.reinterpretcast wrote:
> This is consistent with `convertWithTable` but not `convertWithIconv` with respect to whether `Result` is being appended to as opposed to it having its contents replaced. All three should be consistent.
Thanks for catching this! All three converters should now behave in the same way, that is, replacing the content.

================
Comment at: llvm/lib/Support/CharSet.cpp:151
+      const size_t Used = Capacity - OutputLength;
+      Capacity *= 2;
+      Result.resize(Capacity);
----------------
hubert.reinterpretcast wrote:
> Missing overflow detection for when the most significant bit of `Capacity` is already set.
Changed the handling to detect overflows.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88741/new/

https://reviews.llvm.org/D88741