[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication

Mon May 2 11:07:55 PDT 2016

Hi Tobias,

according to [1], we can expect 90% of the turbo boost peak of the
processor with a C version of Matrix-Matrix Multiplication that is
similar to the one presented in [1]. In case of Intel Core i7-3820
SandyBridge, the theoretical maximal performance of the machine is
28.8 gflops and hence the expected number is 25,92 gflops.

However, in case of, for example, n = m = 1056 and k = 1024 a code
based on BLIS framework takes 0.088919 seconds and hence 25,68 gflops.
I’m not sure whether a C implementation, which similar to one the
presented in [1], can outperform a code based on BLIS framework. What
do you think about it?  What performance should we necessarily reach?

In case of, for example,  n = m = 1056 and k = 1024, I’ve managed to
reduce the execution time from 0.107210 to 0,097794 seconds (a
corresponding implementation can be found attached) and hence get
23,35 gflops. To do so, the following things were implemented:

1. An algorithm presented in [1] and, probably, in [2] assumes that
all matrices are stored in column-major order. That’s why 'i' an 'j'
loops were interchanged and parameters kc, nc, mc were recomputed to
correspond to row-major order.
2. The algorithm presented in [1] differs from the one presented in
[2]. It performs only one level of tiling of a 'j' loop and gets
micro-panels of width nr. Furthermore, it always uses a static buffer
with the sizes 1000 x kc. Although it helps to avoid repacking of the
Ac for every iteration of the Loop 1 from [2], it limits the size of
the ’n’ dimension to 1000. I’ve also implemented it in the attached
program (however, it uses a buffer with the sizes 1056 x 1024).
3. I’ve found out that along with kernels written in assembly a
repository with BLIS framework also contains their C implementations
[3] (Sorry, I managed to do it only after they reorganized their
repository. I haven’t determined the purpose of C implementations
yet). The implementation, which targets Sandy Bridge architecture [4],
contains inline assembly with 'prefetch' instructions. Using it in the
attached program helps to reduce the execution time from 0.12 to
0,097794 seconds.

I have one more question. Could we create additional threads to
perform packing transformation?

If I’m not mistaken, BLIS framework applies multithreading in its
implementation [5]. That’s why we would probably get more than
0.088919 seconds mentioned above, if the multithreading were disabled
(I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP
threads. However, I haven’t found a way to avoid usual
multithreading).

Refs.

[1] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
[2] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[3] - https://github.com/flame/blis/tree/master/kernels/x86_64/sandybridge/3
[4] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c
[5] - https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_blk_var3f.c

-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD.c
Type: text/x-csrc
Size: 5715 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160502/ae26ee5e/attachment.c>