[PATCH] D51542: [X86] Remove wrong ReadAdvance from multiclass sse_fp_unop_s

Mon Sep 3 07:50:59 PDT 2018

spatel accepted this revision.
spatel added a comment.
This revision is now accepted and ready to land.

In https://reviews.llvm.org/D51542#1222176, @andreadb wrote:

> In https://reviews.llvm.org/D51542#1222161, @spatel wrote:
>
> > In https://reviews.llvm.org/D51542#1222120, @andreadb wrote:
> >
> > > In https://reviews.llvm.org/D51542#1221947, @andreadb wrote:
> > >
> > > > In https://reviews.llvm.org/D51542#1221933, @RKSimon wrote:
> > > >
> > > > > There seems to be 2 things here - (1) ReadAfterLd is being used incorrectly to try to make a dependency on the dst reg and (2) we have no mechanism to handle partial dependencies like for the scalar sse instructions.
> > > > >
> > > > > AFAICT this patch fixes the (1) issue but (2) isn't handled at all by the scheduler models yet.
> > > >
> > > >
> > > > Yep.
> > > >
> > > > I will raise a bug for 2. This patch only addresses the issue with ReadAfterLd being incorrectly applied to a register used as the base address of the folded load.
> > >
> > >
> > > Raised bug https://bugs.llvm.org/show_bug.cgi?id=38813.
> >
> >
> > Thanks! I think everyone agrees that the ReadAfterLd is not behaving as intended on the SSE instructions. I am still wondering about the AVX case (and the SSE intrinsic is the same?):
> >  As noted, this is testing with llvm-mca with skylake CPU:
> >
> >   With ReadAfterLd:
> >  
> >   [0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
> >   [0,1]     D=eE-------R   .   leaq	8(%rsp,%rdi,2), %rax
> >   [0,2]     D====eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3  <--- execution delayed by vdppd?
> >  
> >   No ReadAfterLd:
> >  
> >   [0,0]     DeeeeeeeeeER   .    .   vdppd	$1, %xmm0, %xmm1, %xmm2
> >   [0,1]     D=eE-------R   .    .   leaq	8(%rsp,%rdi,2), %rax
> >   [0,2]     D=========eeeeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3   <--- execution delayed until xmm2 is known
> >
> >
> > Do we have confirmation from hardware testing or Intel docs that the 1st timeline is correct? We can punt this question to another patch to not delay this one if needed, but I'd like to understand what the hardware does in this case.
>
>
> VRSQRTSS depends on XMM2, so it delayed by vdppd.
>  XMM2 doesn't have to necessarily be ready immediately when VRSQRTSS starts executing. That is because the uOp for the folded load is executed first, and then the loaded value is forwarded to the pipeline that executes VRSQRTSS. Only at that point, XMM2 has to be available.
>
> I don't have a skylake to microbenchmark that sequence. However, (in my mental model) it kind of makes sense to have a ReadAfterLd there. I would be a "strange" otherwise.

Note: we can observe similar behavior with btver2 as the CPU model with llvm-mca (because the ReadAfterLd is on the default instruction definition):

  [0,0]     DeeeeeeeeeER   .   vdppd	$1, %xmm0, %xmm1, %xmm2
  [0,1]     .DeeE------R   .   leaq	8(%rsp,%rdi,2), %rax
  [0,2]     . D====eeeeeeeER   vrsqrtss	(%rax), %xmm2, %xmm3

And I was thinking the opposite - it seems strange to me that hardware would devote resources to make this case faster, but maybe it's not as difficult as I'm imagining.

We don't need to hold this patch up before we answer that, so LGTM.

https://reviews.llvm.org/D51542