[PATCH] D51542: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s

Fri Aug 31 15:45:34 PDT 2018

andreadb added a comment.

Hi Sanjay,

Thanks for the feedback.

In https://reviews.llvm.org/D51542#1221074, @spatel wrote:

> I think this requires an understanding of the intent of ReadAfterLd:
>
>   // Instructions with folded loads need to read the memory operand immediately,
>   // but other register operands don't have to be read until the load is ready.
>   // These operands are marked with ReadAfterLd.
>
>
> ...that https://reviews.llvm.org/D51534 did not. That's because a broadcast only has one source operand, so ReadAfterLd doesn't even make sense on that instruction?

Not only it didn't make any sense. It was even harmful because it was decreasing the use latency of the register used as the base address for the folded load by 'ReadAfterLd' cycles..

> In this case, we have 2 source operands:
> 
> 1. The loaded value that we're doing the math on.
> 2. The unchanging vector lanes of the second source (destination) register.

I think you are getting confused. These are not instructions with 2 input operands. 
These are just SSE1/SSE2 unary operations (one def, and one use; see below the tablegen definition).

  def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins x86memop:$src1)

> The patch has the intended effect of making the math op depend on the load operand, but it's not clear to me what is or should be happening in a case like this on skylake:
> 
> Trunk:
> 
>   [0,0]     DeeeeeeeeeER.   dppd	$1, %xmm1, %xmm2 <--- long latency, but pipelined
>   [0,1]     D=eE-------R.   leaq	8(%rsp,%rdi,2), %rax
>   [0,2]     D=eeeeeeeeeER   rsqrtss	(%rax), %xmm2 <--- wrong: this can't start executing before %rax is loaded
>   
> 
> 
> Apply this patch (remove ReadAfterLd:)
> 
>   [0,0]     DeeeeeeeeeER .   dppd	$1, %xmm1, %xmm2
>   [0,1]     D=eE-------R .   leaq	8(%rsp,%rdi,2), %rax
>   [0,2]     D==eeeeeeeeeER   rsqrtss	(%rax), %xmm2  <--- is this right? the calc can begin before xmm2 is known?

Here rsqrtss cannot start if %rax is not available.
The value of %xmm2 is not needed because a) it is not an input register, and b) false dependencies on %xmm2 are eliminated by the register renamer.

> But with AVX the 2nd source is explicit, and ReadAfterLd has a different effect:
> 
>   [0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
>   [0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
>   [0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?
>   

Here the use latency of %xmm2 is decreased by a number of cycles specified by the read-advance.
>From the look of that timeline, you are probably testing for Skylake. Skylake defines ReadAfterLd as follows:

  def : ReadAdvance<ReadAfterLd, 5>;

So the vrsqrtss can assume that %xmm2 is available 5 cycles in advance (i.e the use latency is decreased by 5 cycles).

> No ReadAfterLd:
> 
>   [0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
>   [0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
>   [0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known
>   

My patch doesn't modify/remove that ReadAfterLd. It would be wrong.
As I wrote, this patch only affects SSE1/2 definitions.

I hope this helps
-Andrea

https://reviews.llvm.org/D51542