[libc-commits] [libc] [libc] utf8 to 32 CharacterConverter (PR #143973)

Brooks Moses via libc-commits libc-commits at lists.llvm.org
Fri Jun 13 15:51:23 PDT 2025


================
@@ -22,13 +25,65 @@ bool CharacterConverter::isComplete() {
   return state->bytes_processed == state->total_bytes;
 }
 
-int CharacterConverter::push(char8_t utf8_byte) {}
+int CharacterConverter::push(char8_t utf8_byte) {
+  // Checking the first byte if first push
+  if (state->bytes_processed == 0 && state->total_bytes == 0) {
+    state->partial = static_cast<char32_t>(0);
+    uint8_t numOnes = static_cast<uint8_t>(cpp::countl_one(utf8_byte));
+    // 1 byte total
+    if (numOnes == 0) {
+      state->total_bytes = 1;
+    }
+    // 2 through 4 bytes total
+    else if (numOnes >= 2 && numOnes <= 4) {
+      /* Since the format is 110xxxxx, 1110xxxx, and 11110xxx for 2, 3, and 4,
+      we will make the base mask with 7 ones and right shift it as necessary. */
+      constexpr size_t significant_bits = 7;
+      state->total_bytes = numOnes;
+      utf8_byte &=
+          (mask_trailing_ones<uint32_t, significant_bits>() >> numOnes);
+    }
+    // Invalid first byte
+    else {
+      return -1;
+    }
+    state->partial = static_cast<char32_t>(utf8_byte);
+    state->bytes_processed++;
+    return 0;
+  }
+  // Any subsequent push
+  // Adding 6 more bits so need to left shift
+  constexpr size_t ENCODED_BITS_PER_UTF8 = 6;
+  if (cpp::countl_one(utf8_byte) == 1 && !isComplete()) {
+    char32_t byte =
+        utf8_byte & mask_trailing_ones<uint32_t, ENCODED_BITS_PER_UTF8>();
+    state->partial = state->partial << ENCODED_BITS_PER_UTF8;
+    state->partial |= byte;
+    state->bytes_processed++;
+    return 0;
+  }
+  // Invalid byte -> reset if we didn't get successful complete read
+  if (!isComplete()) {
----------------
brooksmoses wrote:

I'm a little confused by the logic here.  Why are we only resetting this if we try to push an invalid byte into an incomplete character.  Shouldn't an attempt to push a byte into a complete state also cause the state to be reset?

With the current arrangement, it's possible to get the CharacterConverter into a state where it is erroring on pushes and the only way to get it to stop erroring is to call "pop" on it.  That seems very counterintuitive, because "the push function returned -1" should mean "calling pop is invalid".

https://github.com/llvm/llvm-project/pull/143973


More information about the libc-commits mailing list