[PATCH] D18751: [MachineCombiner] Support for floating-point FMA on ARM64

Fri Apr 15 13:50:18 PDT 2016

v_klochkov added a comment.

`I worked on X86-FMA optimization in some other compiler and was switched to LLVM project just recently.

It is unlikely that anything from written below can be implemented in this change-set.
It is good that the new interface asks the target for permissions to generate or not generate
FMAs in DAGCombiner, but it does not help to solve the problems described below.

This should not stop you from adding these changes to LLVM.
Also, do not consider me as a reviewer, The primary reviewers should give their opinion.

Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
to let other developers have them in view/attention when they add latency-vs-throughput fixes.

My biggest concern regarding making Latency-vs-Throughput decisions is that
such decisions are often made using just one pattern or DAG, it is not based on the whole loop
analysis (perhaps I am missing something in LLVM).

I provided 4 examples having quite similar code in them.

  Example1 - shows that FMAs can be very harmful for performance on Haswell.
  Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
             It also shows that it is often to replace only 1 FMA to fix the problem and leave other FMAs.
  Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
  Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
             exists, and tuning may be different for different DAGs in one loop.

Ok, let's start...

Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.

Example1:
!NOTE: Please assume that the C code below only represents the structure of the final ASM code
(i.e. the loop is not unrolled, etc.)

    // LOOP1
    for (unsigned i = 0; i < N; i++) {
      accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
    }
    with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
    without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
                MUL operation adds nothing as it is computed out-of-order,
  			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.

Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
comparing to the code without FMAs.

On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
work the same time because the latency of ADD is equal to latency of FMA there.

Example2:
The same problem still can be easily reproduced on SkyLake even though the 
latencies of MUL/ADD/FMA are all equal there:

  // LOOP2
  for (unsigned i = 0; i < N; i++) {
    accu = a[i] * b + c[i] * d + accu;
  }

There may be at least 3 different sequences for the LOOP2:
S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); //LATENCY = 2xLAT(FMA) = 2*4
S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) // LATENCY = 2xLAT(ADD) = 2*4
S3: 1xFMA:  ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4

In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
before they are ready to be consumed by ADD.
So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.

Example3:
It shows that the heuristics that could be implemented for Example1 and Example2
may be wrong if applied without the whole loop analysis.

  // LOOP3
  for (unsigned i = 0; i < N; i++) {
    accu1 = a1[i] * b + c1[i] * d + accu1;
    accu2 = a2[i] * b + c2[i] * d + accu2;
    accu3 = a3[i] * b + c3[i] * d + accu3;
    accu4 = a4[i] * b + c4[i] * d + accu4;
    accu5 = a5[i] * b + c5[i] * d + accu5;
    accu6 = a6[i] * b + c6[i] * d + accu6;
    accu7 = a7[i] * b + c7[i] * d + accu7;
    accu8 = a8[i] * b + c8[i] * d + accu8;
  }

This loop must be tuned for throughput because there are many independent DAGs
putting high pressure on the CPU execution units.
The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
"ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
It works faster because the loop is bounded by throughput.

On SkyLake:

  T = approximate throughput of the loop counted in clock-ticks = 
    N * 16 operations / 2 execution units = N*8
  L = latency of the loop = 
    N * 2*Lat(FMA) = N*2*4 = N*8

The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).

The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
the time spent in the loop is MAX(L,T) will only be bigger.

Example4:
There may be mixed tuning, i.e. for both throughput and latency in one loop:

  // LOOP4
  for (unsigned i = 0; i < N; i++) {
    accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
    accu2 = a2[i] * b + accu2; // tune for throughput
    accu3 = a3[i] * b + accu3; // tune for throughput
    accu4 = a4[i] * b + accu4; // tune for throughput
    accu5 = a5[i] * b + accu5; // tune for throughput
    accu6 = a6[i] * b + accu6; // tune for throughput
  }

On Haswell:
If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then

  Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
  Throughput T = N * 7 / 2
  MAX (L,T) = N*10

Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to

  L = N*Latency(FMA) = N*5,
  and will only slightly increase T from N*3.5 to 
  T = N * 8 operations / 2 execution units = N*4

As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.

Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).

  L = N*Latency(ADD) = N*3
  T = N * 13 operations / 2 = N*6.5
  MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.

So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
`

http://reviews.llvm.org/D18751