[PATCH] D129775: [x86] use zero-extending load of a byte outside of loops too
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Jul 16 10:02:38 PDT 2022
pcordes accepted this revision.
pcordes added inline comments.
This revision is now accepted and ready to land.
================
Comment at: llvm/test/CodeGen/X86/ushl_sat_vec.ll:72
; X86-NEXT: movl {{[0-9]+}}(%esp), %edi
; X86-NEXT: movb {{[0-9]+}}(%esp), %cl
; X86-NEXT: movl {{[0-9]+}}(%esp), %ebx
----------------
craig.topper wrote:
> RKSimon wrote:
> > Why did only 1 of these movb get extended?
> %ch is live from line 69.
It's possible to still avoid the false dependency by doing `movzbl (mem), %ecx` first, then `movb (mem), %ch`.
Reading the full CX/ECX/RCX will still need a merging uop on Intel SnB-family CPUs (which rename high-8 registers separately from the full reg), and unfortunately that merging uop has to issue in a cycle by itself. (So in terms of front-end cost, that extra cost is 4 or 5 uops, not just 1 more. Back-end contention for execution units is rarely a limiting factor for uops that can run on any port). But that merging cost is not paid until later, on the first read.
And on AMD CPUs (and Silvermont-family, like Alder Lake E-cores), there's no later merging cost, writing CH merges on the spot, so it's a nice win vs. movzbl into a temporary; shl $8,%tmp ; or %tmp,%ecx
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D129775/new/
https://reviews.llvm.org/D129775
More information about the llvm-commits
mailing list