[PATCH] D11370: Improved the interface of methods commuting operands, improved X86-FMA3 mem-folding&coalescing.

Thu Aug 20 11:25:04 PDT 2015

v_klochkov added a comment.

Hi Quentin,

Thank you for the good idea (commuteInstruction**Impl**()). Hopefully, I understood it right.
I uploaded a new change-set. Would you please review the additional changes?

Also, I answered your question regarding the stability/correctness fix - please see my answer right after your inline comment/question. In my opinion, stability/correctness has priority over performance in this particular question.
There are at least 2 way how to fix the conservative code-gen.

Thank you,
Slava

================
Comment at: llvm/lib/Target/X86/X86InstrInfo.cpp:3499
@@ +3498,3 @@
+  if (OpcodeAlts[i].IsScalar && (SrcOpIdx1 == 1 || SrcOpIdx2 == 1))
+    return 0;
+
----------------
qcolombet wrote:
> My understanding is that you are address this point here:
> > (2) Fixed a correctness problem caused by commuting 1st and 2nd operands of scalar FMAs generated for intrinsics.
> 
> Most of the time I think we do not care about the high level bits of the value (which is what you are fixing here). Therefore, I wonder if we are not being pessimistic on the commutation opportunities.
> I agree we should seek correctness first, but I wonder how often that high level setting is actually expected… We had this bug forever and apparently nobody noticed it. 
> 
> Anyway, what is your plan to get us the performance back?
That correctness problem exists for FMAs and does not exist for ADD/MUL operations.
Also, FMAs are relatively new instructions.

For example, if you compile the test: 

    #include <immintrin.h>
    double func(double y, double x) {
      return y + x;
    }
    __m128d funcx(__m128d y, __m128d x) {
      return _mm_add_sd(x, y);
    }
then you'll see that only 1 instruction is generated for func() and 2 instructions for funcx().
func() just ignores the upper bits of returned XMM and funcx() correctly handles the upper bits of returned XMM value.

The difference in IR is: ADDSDrr opcode is used in func(), ADDSDrr_Int opcode is used in funcx(). 

So, one of possible solutions could be to add *_Int opcodes for FMA operations like it was done for ADD and MUL operations, and be more conservative for *FMA*_Int opcodes only.

Another solution is mentioned in FIXME comment above, i.e. to implement functionality that can tell if only the lowest element of the result of scalar FMA is used.

In my opinion, these 2 solutions do not exclude each other; they both should be implemented.

Currently, we do not have *FMA*_Int opcodes, that is why it would be better to be more conservative and correct. This patch might make the code a little bit worse/conservative on some corner cases, but it also improves code-gen for many other cases, for example, for those cases where the 1st or 2nd operand can be swapped with 3rd operand when it helps to do memory-op-folding optimization.

http://reviews.llvm.org/D11370