[PATCH] D25291: [AArch64] Optionally use the reciprocal estimation machinery
Sanjay Patel via llvm-commits
llvm-commits at lists.llvm.org
Fri Oct 7 09:29:35 PDT 2016
spatel added a subscriber: v_klochkov.
spatel added a comment.
@jmolloy mentioned the surrounding context for deciding when to use the estimate instructions. I don't think anyone would argue that using an isel attribute to make the decision is anything more than a heuristic.
The alternative is to wait and/or fixup the isel decision in MachineCombiner or some other machine pass. But I think it's worth copying this comment from https://reviews.llvm.org/D18751 / @v_klochkov again - this is at least the 3rd time I've done this. :)
The comment is about FMA, and the examples use x86 cores, but the problem for the compiler is the same: choosing the optimal instructions is a hard problem, and it may not be possible to make this decision without some kind of heuristic.
In https://reviews.llvm.org/D18751#402906, @v_klochkov wrote:
> Here I just wanted to add some notes regarding Latency-vs-Throughput problem in X86
> to let other developers have them in view/attention when they add latency-vs-throughput fixes.
>
> My biggest concern regarding making Latency-vs-Throughput decisions is that
> such decisions are often made using just one pattern or DAG, it is not based on the whole loop
> analysis (perhaps I am missing something in LLVM).
>
> I provided 4 examples having quite similar code in them.
>
> Example1 - shows that FMAs can be very harmful for performance on Haswell.
> Example2 - is similar to Example1, shows that FMAs can be harmful on Haswell and newer CPUs like Skylake.
> It also shows that it is often enough to replace only 1 FMA to fix the problem and leave other FMAs.
> Example3 - shows that solutions for Example1 and Example2 can easily be wrong.
> Example4 - shows that there is no ONE solution like "tune for throughput" or "tune for latency"
> exists, and tuning may be different for different DAGs in one loop.
>
>
> Ok, let's start...
>
> Fusing MUL+ADD into FMA may easily be inefficient at Out-Of-Order CPUs.
> The following trivial loop works about 60-70% slower on Haswell(-march=core-avx2) if FMA is generated.
>
> Example1:
> !NOTE: Please assume that the C code below only represents the structure of the final ASM code
> (i.e. the loop is not unrolled, etc.)
>
> // LOOP1
> for (unsigned i = 0; i < N; i++) {
> accu = a[i] * b + accu;// ACCU = FMA(a[i],b,ACCU)
> }
> with FMAs: The latency of the whole loop on Haswell is N*Latency(FMA) = N*5
> without FMAs: The latency of the whole loop on Haswell is N*Latency(ADD) = N*3
> MUL operation adds nothing as it is computed out-of-order,
> i.e. the result of MUL is always available when it is ready to be consumed by ADD.
>
>
> Having FMAs for such loop may result in (N*5)/(N*3) = (5/3) = 1.67x slowdown
> comparing to the code without FMAs.
>
> On SkyLake(CPUs with AVX512) both version of LOOP1 (with and without FMA) would
> work the same time because the latency of ADD is equal to latency of FMA there.
>
> Example2:
> The same problem still can be easily reproduced on SkyLake even though the
> latencies of MUL/ADD/FMA are all equal there:
>
> // LOOP2
> for (unsigned i = 0; i < N; i++) {
> accu = a[i] * b + c[i] * d + accu;
> }
>
>
> There may be at least 3 different sequences for the LOOP2:
> S1: 2xFMAs: ACCU = FMA(a[i],b,FMA(c[i],d,ACCU); //LATENCY = 2xLAT(FMA) = 2*4
> S2: 0xFMAs: ACCU = ADD(ADD(MUL(a[i],b),MUL(c[i],d)),ACCU) // LATENCY = 2xLAT(ADD) = 2*4
> S3: 1xFMA: ACCU = ADD(ACCU, FMA(a[i],b,MUL(c[i]*d))) // LATENCY = 1xLAT(ADD) = 4
>
> In (S3) the MUL and FMA operations do not add anything to the latency of the whole expression
> because Out-Of-Order CPU has enough execution units to prepare the results of MUL and FMA
> before they are ready to be consumed by ADD.
> So (S3) would be about 2 times faster on SkyLake and up to 3.3 times faster on Haswell.
>
> Example3:
> It shows that the heuristics that could be implemented for Example1 and Example2
> may be wrong if applied without the whole loop analysis.
>
> // LOOP3
> for (unsigned i = 0; i < N; i++) {
> accu1 = a1[i] * b + c1[i] * d + accu1;
> accu2 = a2[i] * b + c2[i] * d + accu2;
> accu3 = a3[i] * b + c3[i] * d + accu3;
> accu4 = a4[i] * b + c4[i] * d + accu4;
> accu5 = a5[i] * b + c5[i] * d + accu5;
> accu6 = a6[i] * b + c6[i] * d + accu6;
> accu7 = a7[i] * b + c7[i] * d + accu7;
> accu8 = a8[i] * b + c8[i] * d + accu8;
> }
>
> This loop must be tuned for throughput because there are many independent DAGs
> putting high pressure on the CPU execution units.
> The sequence (S1) from example2 is the best solution for all accumulators in LOOP3:
> "ACCUi = FMA(ai[i] * b, FMA(ci[i] * d, ACCUi)".
> It works faster because the loop is bounded by throughput.
>
> On SkyLake:
>
> T = approximate throughput of the loop counted in clock-ticks =
> N * 16 operations / 2 execution units = N*8
> L = latency of the loop =
> N * 2*Lat(FMA) = N*2*4 = N*8
>
> The time spent in such loop is MAX(L,T) = MAX(N*8, N*8).
>
> The attempts to replace FMAs with MUL and ADD may reduce (L), but will increase (T),
> the time spent in the loop is MAX(L,T) will only be bigger.
>
> Example4:
> There may be mixed tuning, i.e. for both throughput and latency in one loop:
>
> // LOOP4
> for (unsigned i = 0; i < N; i++) {
> accu1 = a1[i] * b + c1[i] * d + accu1; // tune for latency
> accu2 = a2[i] * b + accu2; // tune for throughput
> accu3 = a3[i] * b + accu3; // tune for throughput
> accu4 = a4[i] * b + accu4; // tune for throughput
> accu5 = a5[i] * b + accu5; // tune for throughput
> accu6 = a6[i] * b + accu6; // tune for throughput
> }
>
> On Haswell:
> If generate 2 FMAs for ACCU1 and 1 FMA for each of ACCU2,..6, then
>
> Latency of the loop is L = N*2*Latency(FMA) = N*2*5
> Throughput T = N * 7 / 2
> MAX (L,T) = N*10
>
>
> Using 1xMUL+1xFMA+1xADD for ACCU1 will reduce the latency L from N*2*5 to
>
> L = N*Latency(FMA) = N*5,
> and will only slightly increase T from N*3.5 to
> T = N * 8 operations / 2 execution units = N*4
>
> As a result using sequence (S3) will reduce MAX(L,T) from N*10 to MAX(N*5,N*4) = N*5.
>
>
>
> Splitting FMAs in ACCU2,..6 will only increase MAX(L,T).
>
> L = N*Latency(ADD) = N*3
> T = N * 13 operations / 2 = N*6.5
> MAX(L,T) = MAX(N*3, N*6.5) = N*6.5.
>
>
> So, the best solution in example4 is to split 1 FMA in ACCU1, but keep all other FMAs.
> `
Repository:
rL LLVM
https://reviews.llvm.org/D25291
More information about the llvm-commits
mailing list