[llvm-dev] Determination of statements that contain only matrix multiplication

Roman Gareev via llvm-dev llvm-dev at lists.llvm.org
Fri May 20 06:05:22 PDT 2016


2016-05-19 21:45 GMT+05:00 4lbert C0hen <4lbert.h.c0hen at gmail.com>:
> One short note. I would advise against spending time on prefetching for x86.
> Recent hardware prefetchers are amazingly good at strided accesses in
> single-threaded code. Caution: this is not based on objective/published
> data, but on personal experience.
>
> There are open challenges in multiprocessor prefecthing, even for regularly
> strided data, but these are probabably too ambitious to be tackled
> effectively in the time frame of a SoC. There are lots of papers on this
> however.
>
> Now, if you are targeting lower power processors, including most ARM v7a/v8
> implementations, prefetching may be much more important. There are much
> fewer publications on BLAS optimizations for ARM, but they exist. Let me
> know if you need pointers.
>
> In any case, if you have to go beyond loop transformations, unrolling,
> register blocking, I would advise to look into data layout transformations,
> compacting strided blocks, which is one of the key optimizations in [2], or
> transposition. These primarily help with TLB misses, enabling the full power
> of 3D tiling (they have less impact on 2D tiling). And by the way, they will
> in turn improve the effectiveness of hardware prefetching.

Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in [1] and, according to [2], in some cases 90% of the
turbo boost peak of the processor can be attained without it.

I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework [3]. If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in [1] p. 11). For example,
its using helps to reduce the execution time of the attached
implementation.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
[3] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c

-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_C_SIMD.c
Type: text/x-csrc
Size: 5697 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160520/16edb8dc/attachment.c>


More information about the llvm-dev mailing list