Sat Jun 14 16:56:57 PDT 2014


            Bug ID: 20043
           Summary: Only one version of FMA3 instruction is being
           Product: clang
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: -New Bugs
          Assignee: unassignedclangbugs at nondot.org
          Reporter: chris.a.ferguson at gmail.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

Given the following code:

#include <immintrin.h>

__m128 fmatest(__m128 x)
    return _mm_fmadd_ps(x, _mm_set1_ps(2.0f), _mm_set1_ps(-1.0f)); 

I get the following output from Clang 3.4 (using -O3 -march=core-avx2):

    .long    3212836864              # float -1
    .long    1073741824              # float 2
fmatest(float __vector(4)):                           # @fmatest(float
    vbroadcastss    xmm2, dword ptr [rip + .LCPI0_0]
    vbroadcastss    xmm1, dword ptr [rip + .LCPI0_1]
    vfmadd213ps    xmm1, xmm0, xmm2
    vmovaps    xmm0, xmm1

The vmovaps would be unnecessary if an alternate fmadd instruction were used.
For instance this is what GCC 4.9 produces:

fmatest(float __vector):
    vmovaps    xmm1, XMMWORD PTR .LC1[rip]
    vfmadd132ps    xmm0, xmm1, XMMWORD PTR .LC0[rip]
    .long    1073741824
    .long    1073741824
    .long    1073741824
    .long    1073741824
    .long    3212836864
    .long    3212836864
    .long    3212836864
    .long    3212836864

