[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Fri Jun 9 22:11:25 PDT 2023

nafi3000 accepted this revision.
nafi3000 added inline comments.

================
Comment at: libc/src/string/memory_utils/utils.h:157-159
+  // We perform the difference as an uint64_t.
+  const int64_t diff = static_cast<int64_t>(a) - static_cast<int64_t>(b);
+  // And reduce the uint64_t into an uint32_t.
----------------
nit: s/uint64_t/int64_t/ and s/uint32_t/int32_t/ in the comments.

================
Comment at: libc/src/string/memory_utils/utils.h:160
+  // And reduce the uint64_t into an uint32_t.
+  // TODO: provide a detailed explanation.
+  return static_cast<int32_t>((diff >> 1) | (diff & 0xFFFF));
----------------
For the explanation, please consider whether we can add some version of the following points:
```
For the int64_t to int32_t conversion we want the following properties:
- int32_t[31:31] == 1 iff diff < 0
- int32_t[31:0] == 0 iff diff == 0

We also observe that:
- When diff < 0: diff[63:32] == 0xffffffff and diff[31:0] != 0
- When diff > 0: diff[63:32] == 0 and diff[31:0] != 0
- When diff == 0: diff[63:32] == 0 and diff[31:0] == 0
- https://godbolt.org/z/8W7qWP6e5
- This implies that we can only look at diff[32:32] for determining the sign bit for the returned int32_t.

So, we do the following:
- int32_t[31:31] = diff[32:32]
- int32_t[30:0] = diff[31:0] == 0 ? 0 : non-0.

And, we can achieve the above by the expression below. We could have also used (diff64 >> 1) | (diff64 & 0x1) but (diff64 & 0xFFFF) is faster than (diff64 & 0x1). https://godbolt.org/z/j3b569rW1
```

We can also add all these in a separate diff.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717