[LLVMdev] X86 FMA4

Sat Jul 28 23:57:49 PDT 2012

Our specialists (Intel) say that “vmovaps” and “vmovsd” have the same throughput and latency, but “vmovsd” reduces chance of 4k aliasing, so it is preferable.

- Elena
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Cameron McInally
Sent: Thursday, July 26, 2012 17:50
To: Jan Sjodin
Cc: dag at cray.com; llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] X86 FMA4

Hey Jan and Dave,

It's not obvious, but there is a significant scalar performance issue following the GCC intrinsics.

Let's look at the VFMADDSD pattern. We're operating on scalars with undefineds as the remaining vector elements of the operands. This sounds okay, but when one looks closer...

       vmovsd  fp4_+1088(%rip), %xmm3  # fpppp.f:647
       vmovaps %xmm3, 18560(%rsp)      # fpppp.f:647 <= 16-byte spill
       vfmaddsd        %xmm5, fp4_+3288(%rip), %xmm3, %xmm3 # fpppp.f:647
The spill here is 16-bytes. But, we're only using the low 8-bytes of xmm3. Changing the intrinsics and patterns to accept scalar operands, we end up with...

       vmovsd  fp4_+1056(%rip), %xmm0  # fpppp.f:666
       vmovsd  %xmm0, 10088(%rsp)      # fpppp.f:666 <= 8-byte spill
       vfmaddsd        %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666

I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory.

-Cameron

On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com<mailto:jan_sjodin at yahoo.com>> wrote:
Because the intrinsics uses vector types (same as gcc).

- Jan

----- Original Message -----
> From: "dag at cray.com<mailto:dag at cray.com>" <dag at cray.com<mailto:dag at cray.com>>
> To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>
> Cc:
> Sent: Wednesday, July 25, 2012 3:26 PM
> Subject: [LLVMdev] X86 FMA4
>
> We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns.
>
> Why is VFMADDSD4 defined with vector types?  Is this simply because the
> gcc intrinsic uses vector types?  It's quite unnatural if you have a
> compiler that generates FMAs as opposed to requiring user intrinsics.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120729/a040afd5/attachment.html>