[libc-commits] [PATCH] D148717: [libc] Improve memcmp latency and codegen

Fri Jun 30 01:54:49 PDT 2023

nafi3000 accepted this revision.
nafi3000 added inline comments.

================
Comment at: libc/src/string/memory_utils/utils.h:198-201
+  //   cmp     rdi, rsi         <- serializing
+  //   mov     ecx, -5          <- can be done in parallel
+  //   mov     eax, 5           <- can be done in parallel
+  //   cmovb   eax, ecx         <- serializing
----------------
gchatelet wrote:
> lntue wrote:
> > I wonder what's the tradeoffs between this and what is generated for 1 and -1?  If this is better, then the compiler should just use this for 1 and -1 also, right?
> > I wonder what's the tradeoffs between this and what is generated for 1 and -1?  If this is better, then the compiler should just use this for 1 and -1 also, right?
> 
> x86 does not have conditional negate and codegen for returning 1 and -1 has higher latency.
> ```
>         xor     eax, eax
>         cmp     rdi, rsi <- serializing
>         sbb     eax, eax <- dep on previous instruction
>         or      eax, 1   <- dep on previous instruction
> ```
> 
> I think the tradeoff is around register pressure, in the `-1` / `1` case we just need `eax` at the expense of a longer dependency chain.
> In the `-5` / `5` case we need `ecx` on top of `eax` but the dependency chain is shorter and then latency is reduced. Since latency matters for `memcmp` it makes more sense to use this construct.
> 
> Now TBH I haven't measured that the overall generated code is better but I'll run a few tests before landing.
> 
> https://godbolt.org/z/Gqahv7r7e
The compiler could have also used `edi` or `esi` instead of `ecx`. Would that cause slightly lower register pressure? E.g. why is it not doing something like:
```
cmp rdi, rsi
mov edi, -5
mov eax, 5
cmovb eax, edi
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148717/new/

https://reviews.llvm.org/D148717