[llvm-commits] [llvm] r161152 - in /llvm/trunk: include/llvm/Target/TargetInstrInfo.h lib/CodeGen/PeepholeOptimizer.cpp lib/Target/X86/X86InstrInfo.cpp lib/Target/X86/X86InstrInfo.h test/CodeGen/X86/2012-05-19-avx2-store.ll test/CodeGen/X86/break-sse-dep.ll test/CodeGen/X86/fold-load.ll test/CodeGen/X86/fold-pcmpeqd-1.ll test/CodeGen/X86/sse-minmax.ll test/CodeGen/X86/vec_compare.ll

Thu Aug 2 12:49:10 PDT 2012

On Thu, 2012-08-02 at 11:02 -0700, Jakob Stoklund Olesen wrote:
> On Aug 2, 2012, at 12:31 AM, Michael Liao <michael.liao at intel.com> wrote:
> 
> > Some cases are considered conflicting with the previous effort to remove
> > partial register update stall by Bruno Cardoso Lopes.
> > 
> > For example, sqrtsd with memory operand is such an instruction updating
> > only parts of the registers in SSE. It should be selected if the code is
> > optimized for size. Otherwise, the sequence of movsd + sqrtsd is
> > preferred than sqrtsd with memory operand.
> 
> Actually, our current approach to this is not very good.
> 
> We prevent loads from being folded into sqrtsd:
> 
>   movsd (…), %xmm0
>   sqrtsd %xmm0, %xmm0
> 
> But we don't actually make any effort to make the sqrtsd input and output operands the same, so we might as well produce:
> 
>   movsd (…), %xmm0
>   sqrtsd %xmm0, %xmm1
> 
> Which is completely pointless because there is still a partial register dependency on %xmm1.
> 
> A better approach would be to fold the load aggressively:
> 
>   sqrtsd (…), %xmm1
> 
> And then teach X86InstrInfo::breakPartialRegDependency() to unfold the load instead of inserting an xorps dependency breaking instruction:
> 
>   xorps %xmm1, %xmm1
>   sqrtsd (…), %xmm1

In fact, this's what I want to suggestion to break partial register
install. xorps idiom is better than movsd + sqrtsd by saving 1 byte in
instruction as well as having much efficient support in OOO proccesoors.

If no one works on that, I would start to develop a machine pass to
break this kind partial register stalls.

Yours
- Michael

> 
> Would become:
> 
>   movsd (…), %xmm1
>   sqrtsd %xmm1, %xmm1
> 
> Since this happens after register allocation, we can make sure to pick the same register for the sqrtsd input and output. The load will also only be unfolded where there is a nearby def of %xmm1.
> 
> /jakob