[PATCH] D25291: [AArch64] Optionally use the reciprocal estimation machinery

Fri Oct 7 09:29:35 PDT 2016

spatel added a subscriber: v_klochkov.
spatel added a comment.

@jmolloy mentioned the surrounding context for deciding when to use the estimate instructions. I don't think anyone would argue that using an isel attribute to make the decision is anything more than a heuristic.

The alternative is to wait and/or fixup the isel decision in MachineCombiner or some other machine pass. But I think it's worth copying this comment from https://reviews.llvm.org/D18751 / @v_klochkov again - this is at least the 3rd time I've done this. :)

The comment is about FMA, and the examples use x86 cores, but the problem for the compiler is the same: choosing the optimal instructions is a hard problem, and it may not be possible to make this decision without some kind of heuristic.

In https://reviews.llvm.org/D18751#402906, @v_klochkov wrote:

> Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
>  to let other developers have them in view/attention when they add latency-vs-throughput fixes.
>
> My biggest concern regarding making Latency-vs-Throughput decisions is that
>  such decisions are often made using just one pattern or DAG, it is not based on the whole loop
>  analysis (perhaps I am missing something in LLVM).
>
> I provided 4 examples having quite similar code in them.
>
>   Example1 - shows that FMAs can be very harmful for performance on Haswell.
>   Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
>              It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
>   Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
>   Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
>              exists, and tuning may be different for different DAGs in one loop.
>   
>
> Ok, let's start...
>
> Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
>  The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.
>
> Example1:
>  !NOTE: Please assume that the C code below only represents the structure of the final ASM code
>  (i.e. the loop is not unrolled, etc.)
>
>     // LOOP1
>     for (unsigned i = 0; i < N; i++) {
>       accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
>     }
>     with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5  
>     without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
>                 MUL operation adds nothing as it is computed out-of-order,
>   			  i.e. the result of MUL is always available when it is ready to be consumed by ADD.
>               
>
> Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
>  comparing to the code without FMAs.
>
> On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
>  work the same time because the latency of ADD is equal to latency of FMA there.
>
> Example2:
>  The same problem still can be easily reproduced on SkyLake even though the 
>  latencies of MUL/ADD/FMA are all equal there:
>
>   // LOOP2
>   for (unsigned i = 0; i < N; i++) {
>     accu = a[i] * b + c[i] * d + accu;
>   }
>   
>
> There may be at least 3 different sequences for the LOOP2:
>  S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); //LATENCY = 2xLAT(FMA) = 2*4
>  S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) // LATENCY = 2xLAT(ADD) = 2*4
>  S3: 1xFMA:  ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4
>
> In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
>  because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
>  before they are ready to be consumed by ADD.
>  So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.
>
> Example3:
>  It shows that the heuristics that could be implemented for Example1 and Example2
>  may be wrong if applied without the whole loop analysis.
>
>   // LOOP3
>   for (unsigned i = 0; i < N; i++) {
>     accu1 = a1[i] * b + c1[i] * d + accu1;
>     accu2 = a2[i] * b + c2[i] * d + accu2;
>     accu3 = a3[i] * b + c3[i] * d + accu3;
>     accu4 = a4[i] * b + c4[i] * d + accu4;
>     accu5 = a5[i] * b + c5[i] * d + accu5;
>     accu6 = a6[i] * b + c6[i] * d + accu6;
>     accu7 = a7[i] * b + c7[i] * d + accu7;
>     accu8 = a8[i] * b + c8[i] * d + accu8;
>   }
>
> This loop must be tuned for throughput because there are many independent DAGs
>  putting high pressure on the CPU execution units.
>  The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
>  "ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
>  It works faster because the loop is bounded by throughput.
>
> On SkyLake:
>
>   T = approximate throughput of the loop counted in clock-ticks = 
>     N * 16 operations / 2 execution units = N*8
>   L = latency of the loop = 
>     N * 2*Lat(FMA) = N*2*4 = N*8
>
> The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).
>
> The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
>  the time spent in the loop is MAX(L,T) will only be bigger.
>
> Example4:
>  There may be mixed tuning, i.e. for both throughput and latency in one loop:
>
>   // LOOP4
>   for (unsigned i = 0; i < N; i++) {
>     accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
>     accu2 = a2[i] * b + accu2; // tune for throughput
>     accu3 = a3[i] * b + accu3; // tune for throughput
>     accu4 = a4[i] * b + accu4; // tune for throughput
>     accu5 = a5[i] * b + accu5; // tune for throughput
>     accu6 = a6[i] * b + accu6; // tune for throughput
>   }
>
> On Haswell:
>  If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then
>
>   Latency of the loop is L = N*2*Latency(FMA) = N*2*5   
>   Throughput T = N * 7 / 2
>   MAX (L,T) = N*10
>    
>
> Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to
>
>   L = N*Latency(FMA) = N*5,
>   and will only slightly increase T from N*3.5 to 
>   T = N * 8 operations / 2 execution units = N*4
>
> As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.
>
>     
>
> Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).
>
>   L = N*Latency(ADD) = N*3
>   T = N * 13 operations / 2 = N*6.5
>   MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.
>
>
> So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
>  `

Repository:
  rL LLVM

https://reviews.llvm.org/D25291