[LLVMdev] X86 FMA4

Thu Jul 26 07:49:35 PDT 2012

Hey Jan and Dave,

It's not obvious, but there is a significant scalar performance issue
following the GCC intrinsics.

Let's look at the VFMADDSD pattern. We're operating on scalars with
undefineds as the remaining vector elements of the operands. This sounds
okay, but when one looks closer...

       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte spill
       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647

The spill here is 16-bytes. But, we're only using the low 8-bytes of
xmm3. Changing the intrinsics and patterns to accept scalar operands, we
end up with...

       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666

I do not know the actual number of cycles offhand, but I believe on
Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as
a vmovsd if it involves memory.

-Cameron

On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote:

> Because the intrinsics uses vector types (same as gcc).
>
>
> - Jan
>
>
>
> ----- Original Message -----
> > From: "dag at cray.com" <dag at cray.com>
> > To: llvmdev at cs.uiuc.edu
> > Cc:
> > Sent: Wednesday, July 25, 2012 3:26 PM
> > Subject: [LLVMdev] X86 FMA4
> >
> > We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
> >
> > Why is VFMADDSD4 defined with vector types?  Is this simply because the
> > gcc intrinsic uses vector types?  It's quite unnatural if you have a
> > compiler that generates FMAs as opposed to requiring user intrinsics.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120726/23bfe115/attachment.html>