[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication

Fri May 6 02:57:45 PDT 2016

> So you are saying BLIS itself only reaches 89% of peak? I would have
> expected for them to reach slightly more, but your measurements in
> your GSoC draft seem to give the same numbers, also for MKL. So let's
> assume this is true and not worry about this for now.

OK.

>> I’m not sure whether a C implementation, which similar to one the
>> presented in [1], can outperform a code based on BLIS framework. What
>> do you think about it?  What performance should we necessarily reach?
>
> The first goal I would set is to get within 10-20% percent of what we
> see with BLIS, rather then being an order of magnitude off. The
> precise performance is probably not yet so relevant. What is important
> is to understand what transformations are needed and to ensure the
> most impactful building blocks are available through Polly.

Yes, I think that it's important. Thank you for the advice!

>> In case of, for example,  n = m = 1056 and k = 1024, I’ve managed to
>> reduce the execution time from 0.107210 to 0,097794 seconds (a
>> corresponding implementation can be found attached) and hence get
>> 23,35 gflops.
>
> This is 81% of peak. Even though I would hope we get even closer, I
> think this is already a great first step. Just for comparison it would
> be nice if you could list GFLOP performance we see with clang -O3,
> Polly and icc -O3 -ffast-math

In case of n = m = 1056 and k = 1024, I get the following numbers:

1. clang -O3                         1,09627398514907
2. Polly vect=stripmine         7,614133825873
3. Polly vect=none               3,04573476564844
4. OpenBLAS                      26,39588687139538
5. BLIS                                25,68403297383012
6. Intel MKL                         26,75772431488793

>> To do so, the following things were implemented:
>>
>> 1. An algorithm presented in [1] and, probably, in [2] assumes that
>> all matrices are stored in column-major order. That’s why 'i' an 'j'
>> loops were interchanged and parameters kc, nc, mc were recomputed to
>> correspond to row-major order.
>
> OK.
>
>> 2. The algorithm presented in [1] differs from the one presented in
>> [2]. It performs only one level of tiling of a 'j' loop and gets
>> micro-panels of width nr. Furthermore, it always uses a static buffer
>> with the sizes 1000 x kc. Although it helps to avoid repacking of the
>> Ac for every iteration of the Loop 1 from [2], it limits the size of
>> the ’n’ dimension to 1000. I’ve also implemented it in the attached
>> program (however, it uses a buffer with the sizes 1056 x 1024).
>
> Did you now implement [1] or [2]? I assume it's [1]?

Yes, I implemented [1]. It can be found in the attached file.

>> 3. I’ve found out that along with kernels written in assembly a
>> repository with BLIS framework also contains their C implementations
>> [3] (Sorry, I managed to do it only after they reorganized their
>> repository. I haven’t determined the purpose of C implementations
>> yet). The implementation, which targets Sandy Bridge architecture [4],
>> contains inline assembly with 'prefetch' instructions. Using it in the
>> attached program helps to reduce the execution time from 0.12 to
>> 0,097794 seconds.
>
> Nice.
>
>>
>> I have one more question. Could we create additional threads to
>> perform packing transformation?
>
> Why would we do this? I would believe that multi-threading at the
> outermost loops is likely more efficient?

Yes, it's possibly more efficient. Sorry, I thought that a
multi-threaded version of the packing transformation is used, for
example, in BLIS framework by default. However, we could probably use
it, if we implement a multi-threaded version of matrix multiplication.

>> If I’m not mistaken, BLIS framework applies multithreading in its
>> implementation [5]. That’s why we would probably get more than
>> 0.088919 seconds mentioned above, if the multithreading were disabled
>> (I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP
>> threads. However, I haven’t found a way to avoid usual
>> multithreading).
>
> So you are saying BLIS is using multithreading and still only reaches
> 90% of a single core? This sounds exceptionally inefficient. These
> numbers look a lot more like the single threaded performance of BLIS. I
> would suggest you cross check if this is really single-threaded or
> multi-threaded performance.

Yes, you’re right. This is single-threaded performance, because BLIS
disables multithreading by default
https://github.com/flame/blis/wiki/Multithreading. I confused it with
Intel MKL. Thanks for the advice.

> Overall, these results already suggest that we can get with reasonable
> effort into the range or vendor BLAS libraries.
>
> Do I understand correctly that the only parts that can not yet be
> implemented with a Polly schedule transformation are prefetching as
> well as the on-the-fly transposition of the matrices?

Yes, I think you're right.

-- 
                                    Cheers, Roman Gareev.