[llvm-dev] Determination of statements that contain only matrix multiplication
Roman Gareev via llvm-dev
llvm-dev at lists.llvm.org
Fri May 20 06:05:22 PDT 2016
2016-05-19 21:45 GMT+05:00 4lbert C0hen <4lbert.h.c0hen at gmail.com>:
> One short note. I would advise against spending time on prefetching for x86.
> Recent hardware prefetchers are amazingly good at strided accesses in
> single-threaded code. Caution: this is not based on objective/published
> data, but on personal experience.
> There are open challenges in multiprocessor prefecthing, even for regularly
> strided data, but these are probabably too ambitious to be tackled
> effectively in the time frame of a SoC. There are lots of papers on this
> Now, if you are targeting lower power processors, including most ARM v7a/v8
> implementations, prefetching may be much more important. There are much
> fewer publications on BLAS optimizations for ARM, but they exist. Let me
> know if you need pointers.
> In any case, if you have to go beyond loop transformations, unrolling,
> register blocking, I would advise to look into data layout transformations,
> compacting strided blocks, which is one of the key optimizations in , or
> transposition. These primarily help with TLB misses, enabling the full power
> of 3D tiling (they have less impact on 2D tiling). And by the way, they will
> in turn improve the effectiveness of hardware prefetching.
Thank you very much for the advices! I could probably try to avoid
using of nonhardware prefetching in the project, if Tobias doesn’t
disagree with it. My understanding is that prefetching isn’t used
explicitly in  and, according to , in some cases 90% of the
turbo boost peak of the processor can be attained without it.
I started to consider prefetching, because it’s used in
implementations of gemm micro-kernels of BLIS framework . If I’m
not mistaken, it’s applied to try to make sure that micro-panel Br is
loaded after micro-panel Ar (as required in  p. 11). For example,
its using helps to reduce the execution time of the attached
 - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
 - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
 - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c
Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5697 bytes
Desc: not available
More information about the llvm-dev