[PATCH] D18751: [MachineCombiner] Support for floating-point FMA on ARM64

Fri Apr 15 14:53:28 PDT 2016

> On Apr 15, 2016, at 1:50 PM, Vyacheslav Klochkov <vyacheslav.n.klochkov at gmail.com> wrote:
> 
> v_klochkov added a comment.
> 
> `I worked on X86-FMA optimization in some other compiler and was switched to LLVM project just recently.
> 
> It is unlikely that anything from written below can be implemented in this change-set.
> It is good that the new interface asks the target for permissions to generate or not generate
> FMAs in DAGCombiner, but it does not help to solve the problems described below.
You are raising very good points and thank you for sharing your insights and examples. Note that this change-set does not attempt to solve the problems you outline below. In llvm currently fma’s get always combined. In loops this does not change with the patch at hand. But for straight line code the scheduler can determine a more optimal sequence, which results in net gains overall and does not risk any regression because  throughput is not modeled well. For this llvm needs a software pipeliner infrastructure which can model how an OOO overlaps multiple loop iterations - and this will necessarily be uArch dependent, requires good heuristics to bridge the static vs dynamic gap etc. My actual thought about this is that only the resource part of a pipeliner will do in a “ hybrid" scheme like a) check FMA  sequences on a single iteration and then (when there is a potential gain/faster sequence)  b) evaluate resourceIIs for competing sequences based on #instructions per iteration, #instructions that can be in flight etc. But all that fun possibly comes later and is beyond the modest goal of this patch.

> 
> This should not stop you from adding these changes to LLVM.
> Also, do not consider me as a reviewer, The primary reviewers should give their opinion.
> 
> Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
> to let other developers have them in view/attention when they add latency-vs-throughput fixes.
> 
> My biggest concern regarding making Latency-vs-Throughput decisions is that
> such decisions are often made using just one pattern or DAG, it is not based on the whole loop
> analysis (perhaps I am missing something in LLVM).
> 
> I provided 4 examples having quite similar code in them.
> 
>  Example1 - shows that FMAs can be very harmful for performance on Haswell.
>  Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
>             It also shows that it is often to replace only 1 FMA to fix the problem and leave other FMAs.
>  Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
>  Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
>             exists, and tuning may be different for different DAGs in one loop.
> 
> Ok, let's start...
> 
> Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
> The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.
> 
> Example1:
> !NOTE: Please assume that the C code below only represents the structure of the final ASM code
> (i.e. the loop is not unrolled, etc.)
> 
>    // LOOP1
>    for (unsigned i = 0; i < N; i++) {
>      accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
>    }
>    with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
>    without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
>                MUL operation adds nothing as it is computed out-of-order,
>  			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.
> 
> 
> Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
> comparing to the code without FMAs.
> 
> On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
> work the same time because the latency of ADD is equal to latency of FMA there.
> 
> Example2:
> The same problem still can be easily reproduced on SkyLake even though the 
> latencies of MUL/ADD/FMA are all equal there:
> 
>  // LOOP2
>  for (unsigned i = 0; i < N; i++) {
>    accu = a[i] * b + c[i] * d + accu;
>  }
> 
> There may be at least 3 different sequences for the LOOP2:
> S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); //LATENCY = 2xLAT(FMA) = 2*4
> S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) // LATENCY = 2xLAT(ADD) = 2*4
> S3: 1xFMA:  ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4
> 
> In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
> because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
> before they are ready to be consumed by ADD.
> So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.
> 
> Example3:
> It shows that the heuristics that could be implemented for Example1 and Example2
> may be wrong if applied without the whole loop analysis.
> 
>  // LOOP3
>  for (unsigned i = 0; i < N; i++) {
>    accu1 = a1[i] * b + c1[i] * d + accu1;
>    accu2 = a2[i] * b + c2[i] * d + accu2;
>    accu3 = a3[i] * b + c3[i] * d + accu3;
>    accu4 = a4[i] * b + c4[i] * d + accu4;
>    accu5 = a5[i] * b + c5[i] * d + accu5;
>    accu6 = a6[i] * b + c6[i] * d + accu6;
>    accu7 = a7[i] * b + c7[i] * d + accu7;
>    accu8 = a8[i] * b + c8[i] * d + accu8;
>  }
> 
> This loop must be tuned for throughput because there are many independent DAGs
> putting high pressure on the CPU execution units.
> The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
> "ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
> It works faster because the loop is bounded by throughput.
> 
> On SkyLake:
> 
>  T = approximate throughput of the loop counted in clock-ticks = 
>    N * 16 operations / 2 execution units = N*8
>  L = latency of the loop = 
>    N * 2*Lat(FMA) = N*2*4 = N*8
> 
> The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).
> 
> The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
> the time spent in the loop is MAX(L,T) will only be bigger.
> 
> Example4:
> There may be mixed tuning, i.e. for both throughput and latency in one loop:
> 
>  // LOOP4
>  for (unsigned i = 0; i < N; i++) {
>    accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
>    accu2 = a2[i] * b + accu2; // tune for throughput
>    accu3 = a3[i] * b + accu3; // tune for throughput
>    accu4 = a4[i] * b + accu4; // tune for throughput
>    accu5 = a5[i] * b + accu5; // tune for throughput
>    accu6 = a6[i] * b + accu6; // tune for throughput
>  }
> 
> On Haswell:
> If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then
> 
>  Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
>  Throughput T = N * 7 / 2
>  MAX (L,T) = N*10
> 
> 
> Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to
> 
>  L = N*Latency(FMA) = N*5,
>  and will only slightly increase T from N*3.5 to 
>  T = N * 8 operations / 2 execution units = N*4
> 
> As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.
> 
> Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).
> 
>  L = N*Latency(ADD) = N*3
>  T = N * 13 operations / 2 = N*6.5
>  MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.
> 
> 
> So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
> `
> 
> 
> http://reviews.llvm.org/D18751
> 
> 
>