[libc-commits] [PATCH] D100646: [libc] Add a set of elementary operations

Wed Apr 28 07:44:04 PDT 2021

gchatelet marked 3 inline comments as done.
gchatelet added a subscriber: sivachandra.
gchatelet added inline comments.

================
Comment at: libc/src/string/memory_utils/elements.h:154-166
+inline int Scalar<uint32_t>::ScalarThreeWayCompare(uint32_t a, uint32_t b) {
+  const int64_t la = Endian::ToBigEndian(a);
+  const int64_t lb = Endian::ToBigEndian(b);
+  const int64_t diff = la - lb;
+  return diff ? (diff < 0 ? -1 : 1) : 0;
+}
+template <>
----------------
avieira wrote:
> gchatelet wrote:
> > @avieira the generated code is not that bad but maybe you can come up with a better idea?
> > 
> > The challenge here is to keep the semantic of the [3-way comparison](https://en.wikipedia.org/wiki/Three-way_comparison) that fits in an `int` while working on types which diff is larger than `int`.
> > 
> > Generated code:
> > intel : https://godbolt.org/z/Khxv6b3r9
> > arm : https://godbolt.org/z/YTah338hY
> @gchatelet Avoiding the subtraction and comparing the 'la' and 'lb' variables directly lead to better codegen on AArch64.
Indeed it's better, thank you.

================
Comment at: libc/src/string/memory_utils/elements.h:189-205
+  static int ThreeWayCompare(const char *a, const char *b) {
+    const auto mask = Base::NotEqualMask(Base::Load(a), Base::Load(b));
+    if (!mask)
+      return 0;
+    return CharDiff(a, b, mask);
+  }
+
----------------
avieira wrote:
> gchatelet wrote:
> > @avieira for the vector version of the three way compare we compute the byte-wise not equal mask, convert this mask to GPR and look at the number of trailing zeros (X86 is LittleEndian), this number is the index of the mismatching byte.
> > 
> > After trying to work on the vector directly to extract this byte I figured out it's easier to just reload it, it should be fast and probably faster than a bunch of vector operations.
> > 
> > i suspect the same approach will work for ARM as well, WDYT?
> @gchatelet I did not look at using NEON. The problem with the mask approach is that NEON does not have movemask-like instructions, so finding the first mismatched byte on a vector would require more work than it is worth it I think. SVE2 might help us here in the future though.
We can add an arm specialization as I did for x86 if needed, let's keep it as-is for now.

================
Comment at: libc/test/src/string/memory_utils/CMakeLists.txt:14
+  COMPILE_OPTIONS
+    -march=native
 )
----------------
avieira wrote:
> AArch64 doesn't support -march=native, we use -mcpu=native instead, so it would be good to specify the correct option per target. 
Thx I wasn't aware of this. I found this article, it's helpful and confusing at the same time :-D
https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

@sivachandra should we define a new CMake macro or do you think this should be part of `add_libc_unittest`?
The aim is to use whatever is available on the host machine when performing the tests.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100646/new/

https://reviews.llvm.org/D100646