[libc-commits] [libc] [libc] Implemented CharacterConverter push/pop for utf32->utf8 conversions (PR #143971)

Fri Jun 13 16:27:05 PDT 2025

================
@@ -22,13 +24,61 @@ bool CharacterConverter::isComplete() {
   return state->bytes_processed == state->total_bytes;
 }
 
-int CharacterConverter::push(char8_t utf8_byte) {}
+int CharacterConverter::push(char32_t utf32) {
+  state->partial = utf32;
+  state->bytes_processed = 0;
+  state->total_bytes = 0;
 
-int CharacterConverter::push(char32_t utf32) {}
+  // determine number of utf-8 bytes needed to represent this utf32 value
+  constexpr char32_t ranges[] = {0x7f, 0x7ff, 0xffff, 0x10ffff};
+  constexpr int num_ranges = 4;
+  for (uint8_t i = 0; i < num_ranges; i++) {
+    if (state->partial <= ranges[i]) {
+      state->total_bytes = i + 1;
+      break;
+    }
+  }
+  if (state->total_bytes == 0)
+    return -1;
 
-utf_ret<char8_t> CharacterConverter::pop_utf8() {}
+  return 0;
+}
+
+ErrorOr<char8_t> CharacterConverter::pop_utf8() {
+  if (state->bytes_processed >= state->total_bytes)
+    return Error(-1);
+
+  constexpr char8_t FIRST_BYTE_HEADERS[] = {0, 0xC0, 0xE0, 0xF0};
+  constexpr char8_t CONTINUING_BYTE_HEADER = 0x80;
+
+  // the number of bits per utf-8 byte that actually encode character
+  // information not metadata (# of bits excluding the byte headers)
+  constexpr size_t ENCODED_BITS_PER_UTF8 = 6;
+  constexpr int MASK_ENCODED_BITS =
+      mask_trailing_ones<unsigned int, ENCODED_BITS_PER_UTF8>();
 
-utf_ret<char32_t> CharacterConverter::pop_utf32() {}
+  char32_t output;
+
+  // Shift to get the next 6 bits from the utf32 encoding
+  const char32_t shift_amount =
+      (state->total_bytes - state->bytes_processed - 1) * ENCODED_BITS_PER_UTF8;
+  if (state->bytes_processed == 0) {
+    /*
+      Choose the correct set of most significant bits to encode the length
+      of the utf8 sequence. The remaining bits contain the most significant
+      bits of the unicode value of the character.
+    */
+    output = FIRST_BYTE_HEADERS[state->total_bytes - 1] |
----------------
brooksmoses wrote:

So one thing I'm noticing here is that, for non-constexpr bitfield things like this, you're using a lookup table and Sriya is computing them with bitshifts.  I think either option is fine, but it would be good to be consistent.  Especially since the ones used on decode can be just the inverse of the ones used on encode, so the same lookup table (or the same logic) works for both.

In my opinion, the lookup tables are slightly easier to read, but computing with bitshifts_might_ be a little faster to execute (though I wouldn't rely on that without looking at optimized assembly).

https://github.com/llvm/llvm-project/pull/143971