[libc-commits] [libc] [libc] utf8 to 32 CharacterConverter (PR #143973)

Michael Jones via libc-commits libc-commits at lists.llvm.org
Fri Jun 13 10:16:25 PDT 2025


================
@@ -22,13 +24,60 @@ bool CharacterConverter::isComplete() {
   return state->bytes_processed == state->total_bytes;
 }
 
-int CharacterConverter::push(char8_t utf8_byte) {}
+int CharacterConverter::push(char8_t utf8_byte) {
+  // Checking the first byte if first push
+  if (state->bytes_processed == 0 && state->total_bytes == 0) {
+    state->partial = static_cast<char32_t>(0);
+    uint8_t numOnes = static_cast<uint8_t>(cpp::countl_one(utf8_byte));
+    // 1 byte total
+    if (numOnes == 0) {
+      state->total_bytes = 1;
+    }
+    // 2 through 4 bytes total
+    else if (numOnes >= 2 && numOnes <= 4) {
+      state->total_bytes = numOnes;
+      utf8_byte &= (0x7F >> numOnes);
+    }
+    // Invalid first byte
+    else {
+      return -1;
+    }
+    state->partial = static_cast<char32_t>(utf8_byte);
+    state->bytes_processed++;
+    return 0;
+  }
+  // Any subsequent push
+  // Adding 6 more bits so need to left shift
+  const int BITS_PER_UTF8 = 6;
+  if (cpp::countl_one(utf8_byte) == 1 && !isComplete()) {
+    char32_t byte = utf8_byte & 0x3F;
----------------
michaelrj-google wrote:

More avoiding magic numbers: I'd recommend replacing `0x3F` with `mask_trailing_ones<char32_t, BITS_PER_UTF8>()`
That makes it clear what the relationship between those two numbers is

https://github.com/llvm/llvm-project/pull/143973


More information about the libc-commits mailing list