[PATCH] D129775: [x86] use zero-extending load of a byte outside of loops too

Sat Jul 16 10:02:38 PDT 2022

pcordes accepted this revision.
pcordes added inline comments.
This revision is now accepted and ready to land.

================
Comment at: llvm/test/CodeGen/X86/ushl_sat_vec.ll:72
 ; X86-NEXT:    movl {{[0-9]+}}(%esp), %edi
 ; X86-NEXT:    movb {{[0-9]+}}(%esp), %cl
 ; X86-NEXT:    movl {{[0-9]+}}(%esp), %ebx
----------------
craig.topper wrote:
> RKSimon wrote:
> > Why did only 1 of these movb get extended?
> %ch is live from line 69.
It's possible to still avoid the false dependency by doing `movzbl (mem), %ecx` first, then `movb (mem), %ch`.

Reading the full CX/ECX/RCX will still need a merging uop on Intel SnB-family CPUs (which rename high-8 registers separately from the full reg), and unfortunately that merging uop has to issue in a cycle by itself.  (So in terms of front-end cost, that extra cost is 4 or 5 uops, not just 1 more.  Back-end contention for execution units is rarely a limiting factor for uops that can run on any port).  But that merging cost is not paid until later, on the first read.

And on AMD CPUs (and Silvermont-family, like Alder Lake E-cores), there's no later merging cost, writing CH merges on the spot, so it's a nice win vs. movzbl into a temporary; shl $8,%tmp ; or %tmp,%ecx

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129775/new/

https://reviews.llvm.org/D129775