[PATCH] D51542: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s

Fri Aug 31 10:43:30 PDT 2018

spatel added a comment.

I think this requires an understanding of the intent of ReadAfterLd:

  // Instructions with folded loads need to read the memory operand immediately,
  // but other register operands don't have to be read until the load is ready.
  // These operands are marked with ReadAfterLd.

...that https://reviews.llvm.org/D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

In this case, we have 2 source operands:

1. The loaded value that we're doing the math on.
2. The unchanging vector lanes of the second source (destination) register.

The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:

Trunk:

  [0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
  [0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
  [0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded

Apply this patch (remove ReadAfterLd:)

  [0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
  [0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
  [0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:

  [0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
  [0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
  [0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?

No ReadAfterLd:

  [0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
  [0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
  [0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known

https://reviews.llvm.org/D51542