[LLVMdev] X86 FMA4

Thu Jul 26 11:46:17 PDT 2012

Ah, bad example. This is a general problem for all (maybe most) SSE and AVX
SS/SD patterns though, which is why I mentioned Sandybridge. You can swap
out VFMADDSD in my example for VADDSD or whatever you like.

I have a lion's share of such a change implemented already and performance
is greatly affected. If the community is interested in this change, I would
be happy to prepare a patch.

-Cameron

On Thu, Jul 26, 2012 at 2:27 PM, Jan Sjodin <jan_sjodin at yahoo.com> wrote:

> You can't execute FMA4 instructions on Intel processors, so it doesn't
> really matter what the impact of the move instructions would be, since it
> would end up with an illegal instruction regardless. :) It does perhaps
> bring up an issue of tuning for different architectures, but that is
> something nobody is really looking into at the moment afaik.
>
>
> - Jan
>
> >________________________________
> > From: Cameron McInally <cameron.mcinally at nyu.edu>
> >To: Jan Sjodin <jan_sjodin at yahoo.com>
> >Cc: "dag at cray.com" <dag at cray.com>; "llvmdev at cs.uiuc.edu" <
> llvmdev at cs.uiuc.edu>
> >Sent: Thursday, July 26, 2012 10:49 AM
> >Subject: Re: [LLVMdev] X86 FMA4
> >
> >
> >Hey Jan and Dave,
> >
> >
> >It's not obvious, but there is a significant scalar performance issue
> following the GCC intrinsics.
> >
> >
> >Let's look at the VFMADDSD pattern. We're operating on scalars with
> undefineds as the remaining vector elements of the operands. This sounds
> okay, but when one looks closer...
> >
> >       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
> >       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte spill
> >       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
> >
> >
> >The spill here is 16-bytes. But, we're only using the low 8-bytes of
> xmm3. Changing the intrinsics and patterns to accept scalar operands, we
> end up with...
> >
> >       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
> >       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
> >       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666
> >
> >
> >I do not know the actual number of cycles offhand, but I believe on
> Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as
> a vmovsd if it involves memory.
> >
> >
> >-Cameron
> >
> >
> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote:
> >
> >Because the intrinsics uses vector types (same as gcc).
> >>
> >>
> >>- Jan
> >>
> >>
> >>
> >>----- Original Message -----
> >>> From: "dag at cray.com" <dag at cray.com>
> >>> To: llvmdev at cs.uiuc.edu
> >>> Cc:
> >>> Sent: Wednesday, July 25, 2012 3:26 PM
> >>> Subject: [LLVMdev] X86 FMA4
> >>>
> >>> We're migrating to LLVM 3.1 and trying to use the upstream FMA
> patterns.
> >>>
> >>> Why is VFMADDSD4 defined with vector types?  Is this simply because the
> >>> gcc intrinsic uses vector types?  It's quite unnatural if you have a
> >>> compiler that generates FMAs as opposed to requiring user intrinsics.
> >>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120726/cdf6fef6/attachment.html>