[PATCH] D125202: [Polly] Disable matmul pattern-match + -polly-parallel

Mon May 23 02:48:15 PDT 2022

gareevroman added a comment.

In D125202#3517374 <https://reviews.llvm.org/D125202#3517374>, @Meinersbur wrote:

> In D125202#3514940 <https://reviews.llvm.org/D125202#3514940>, @gareevroman wrote:
>
>> I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.
>
> I fear that $loop_4$ does not have enough work to justify the parallelization overhead.

Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

> Also, there will be false sharing between cache lines. It could be reduced by having the `#pragma omp parallel` outside the matrix multiplication, and only `#pragma omp for` on $loop_4$. However, Polly does not support that yet.

Could you elaborate on why there will be false sharing between cache lines?

> The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125202/new/

https://reviews.llvm.org/D125202