[PATCH] D13710: New X86 FMA3*_Int opcodes for scalar FMA intrinsics.

Tue Nov 3 11:48:55 PST 2015

v_klochkov added a comment.

Elena,

Please see the answers to your questions.

Thank you,
Slava


================
Comment at: llvm/lib/Target/X86/X86InstrInfo.cpp:1815
@@ -1796,2 +1814,3 @@
     { X86::VFNMSUBSSr231r,        X86::VFNMSUBSSr231m,        TB_ALIGN_NONE },
+    { X86::VFNMSUBSSr231r_Int,    X86::VFNMSUBSSr231m_Int,    TB_ALIGN_NONE },
     { X86::VFNMSUBSDr231r,        X86::VFNMSUBSDr231m,        TB_ALIGN_NONE },
----------------
delena wrote:
> I don't understand how you can use the 231 form for scalar intrinsic:
> 
> intr_fmadd_ss( a, b, c) may be translated as
> 
> VFMADD213SS a, b, c
> or
> VFMADD132SS a, c, b
> 
> but you can't generate VFMADD231SS because "a" should go first, you are taking the upper part from it.
Very good question. In the file X86InstrFMA.td I intentionally added a comment noticing that problem.
Please see the line 215 in that file:

  // The FMA 231 form can be get only by commuting the 1st operand of 213 or 231
  // forms and is possible only after special analysis of all uses of the initial
  // instruction. Such analysis do not exist yet and thus introducing the 231
  // form of FMA*_Int instructions is done using an optimistic assumption that
  // such analysis will be implemented eventually.

BTW, I noticed a misprint in that comment and I'll fix it: "213 or 231" --> "213 or 132".
If ONLY the lowest element of FMA213 result is used then it is possible to commute the 1st operand.
Such analysis exist and used in other compilers.

================
Comment at: llvm/test/CodeGen/X86/fma-intrinsics-phi-213-to-231.ll:171
@@ +170,3 @@
+; CHECK-NEXT: retq
+define <4 x float> @fmaddsubps_loop_128(i32 %iter, <4 x float> %a, <4 x float> %b, <4 x float> %c) {
+entry:
----------------
delena wrote:
> The test checks that  FMA intrinsic gives the right form of FMA instruction.
> I don't understand why do you need a loop here. We wrote a lot of FMA intrinsic tests without any loops.
The loop is needed to get the right form of FMA instruction, i.e. the 231 form is generated when there is a LOOP DEPENDENCY on the ADD path. The test checks that 231 form is generated for such loops.

================
Comment at: llvm/test/CodeGen/X86/fma-intrinsics-x86.ll:485
@@ +484,3 @@
+; CHECK-FMA-WIN-NEXT: vmovaps (%{{(rcx|rdx)}}), %xmm{{0|1}}
+; CHECK-FMA-WIN-NEXT: vfnmsub213sd (%r8), %xmm1, %xmm0
+;
----------------
delena wrote:
> you check folding vector load into scalar intrinsic.
> On AVX-512 we support folding scalar load to scalar intrinsic., by matching scalar_to_vector(loadf32) pattern in td file
I agree, the check tests memory folding of vector load into scalar intrinsic.

Memory folding does not work for such test cases (with and without my patch):
  __m128d m = _mm_load_sd(mem);
  __m128d res = _mm_fmadd_sd(a, b, m);
This should be fixed, and I think I know how to easily do that, but I would rather do that in a separate patch.


http://reviews.llvm.org/D13710