[llvm-dev] Determination of statements that contain only matrix multiplication

4lbert C0hen via llvm-dev llvm-dev at lists.llvm.org
Sat May 28 07:48:47 PDT 2016

Sorry for not responding earlier.

On 05/20/2016 03:05 PM, Roman Gareev wrote:
> Thank you very much for the advices! I could probably try to avoid
> using of nonhardware prefetching in the project, if Tobias doesn’t
> disagree with it. My understanding is that prefetching isn’t used
> explicitly in [1] and, according to [2], in some cases 90% of the
> turbo boost peak of the processor can be attained without it.

Too many negations :-) I'm not sure I followed exactly what you wanted 
to say, but I understand that this is not the priority since you can get 
90% of the performance without worrying about prefetching.

> I started to consider prefetching, because it’s used in
> implementations of gemm micro-kernels of BLIS framework [3]. If I’m
> not mistaken, it’s applied to try to make sure that micro-panel Br is
> loaded after micro-panel Ar (as required in [1] p. 11). For example,
> its using helps to reduce the execution time of the attached
> implementation.

Interesting. The BLIS implementation prefetches only the first cache 
line, before traversing a given interval of memory. This clearly 
confirms the implementation relies on hardware preteching to prefetch 
the subsequent lines. This makes a lot of sense. Yet surprisingly, the 
BLIS implementation does not attempt at anticipating the fetch. It 
schedules the prefetch instruction right before the first load of a 
given interval.

> Refs:
> [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
> [2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
> [3] - https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c

More information about the llvm-dev mailing list