[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication

Tue May 3 23:18:47 PDT 2016

On 05/02/2016 08:07 PM, Roman Gareev wrote:
> Hi Tobias,

Hi Roman,

thank you for the updated.

> according to [1], we can expect 90% of the turbo boost peak of the
> processor with a C version of Matrix-Matrix Multiplication that is
> similar to the one presented in [1]. In case of Intel Core i7-3820
> SandyBridge, the theoretical maximal performance of the machine is
> 28.8 gflops and hence the expected number is 25,92 gflops.

I see.

> However, in case of, for example, n = m = 1056 and k = 1024 a code
> based on BLIS framework takes 0.088919 seconds and hence 25,68 gflops.

So you are saying BLIS itself only reaches 89% of peak? I would have
expected for them to reach slightly more, but your measurements in
your GSoC draft seem to give the same numbers, also for MKL. So let's
assume this is true and not worry about this for now.

> I’m not sure whether a C implementation, which similar to one the
> presented in [1], can outperform a code based on BLIS framework. What
> do you think about it?  What performance should we necessarily reach?

The first goal I would set is to get within 10-20% percent of what we
see with BLIS, rather then being an order of magnitude off. The
precise performance is probably not yet so relevant. What is important
is to understand what transformations are needed and to ensure the
most impactful building blocks are available through Polly.

> In case of, for example,  n = m = 1056 and k = 1024, I’ve managed to
> reduce the execution time from 0.107210 to 0,097794 seconds (a
> corresponding implementation can be found attached) and hence get
> 23,35 gflops.

This is 81% of peak. Even though I would hope we get even closer, I
think this is already a great first step. Just for comparison it would
be nice if you could list GFLOP performance we see with clang -O3,
Polly and icc -O3 -ffast-math

> To do so, the following things were implemented:
> 
> 1. An algorithm presented in [1] and, probably, in [2] assumes that
> all matrices are stored in column-major order. That’s why 'i' an 'j'
> loops were interchanged and parameters kc, nc, mc were recomputed to
> correspond to row-major order.

OK.

> 2. The algorithm presented in [1] differs from the one presented in
> [2]. It performs only one level of tiling of a 'j' loop and gets
> micro-panels of width nr. Furthermore, it always uses a static buffer
> with the sizes 1000 x kc. Although it helps to avoid repacking of the
> Ac for every iteration of the Loop 1 from [2], it limits the size of
> the ’n’ dimension to 1000. I’ve also implemented it in the attached
> program (however, it uses a buffer with the sizes 1056 x 1024).

Did you now implement [1] or [2]? I assume it's [1]?

> 3. I’ve found out that along with kernels written in assembly a
> repository with BLIS framework also contains their C implementations
> [3] (Sorry, I managed to do it only after they reorganized their
> repository. I haven’t determined the purpose of C implementations
> yet). The implementation, which targets Sandy Bridge architecture [4],
> contains inline assembly with 'prefetch' instructions. Using it in the
> attached program helps to reduce the execution time from 0.12 to
> 0,097794 seconds.

Nice.

> 
> I have one more question. Could we create additional threads to
> perform packing transformation?

Why would we do this? I would believe that multi-threading at the
outermost loops is likely more efficient?

> If I’m not mistaken, BLIS framework applies multithreading in its
> implementation [5]. That’s why we would probably get more than
> 0.088919 seconds mentioned above, if the multithreading were disabled
> (I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP
> threads. However, I haven’t found a way to avoid usual
> multithreading).

So you are saying BLIS is using multithreading and still only reaches
90% of a single core? This sounds exceptionally inefficient. These
numbers look a lot more like the single threaded performance of BLIS. I
would suggest you cross check if this is really single-threaded or
multi-threaded performance.

Overall, these results already suggest that we can get with reasonable
effort into the range or vendor BLAS libraries.

Do I understand correctly that the only parts that can not yet be
implemented with a Polly schedule transformation are prefetching as
well as the on-the-fly transposition of the matrices?

Best,
Tobias