[PATCH] D125202: [Polly] Disable matmul pattern-match + -polly-parallel

Mon May 23 08:41:42 PDT 2022

Meinersbur added a comment.

In D125202#3530895 <https://reviews.llvm.org/D125202#3530895>, @gareevroman wrote:

> In D125202#3517374 <https://reviews.llvm.org/D125202#3517374>, @Meinersbur wrote:
>
>> In D125202#3514940 <https://reviews.llvm.org/D125202#3514940>, @gareevroman wrote:
>>
>>> I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.
>>
>> I fear that $loop_4$ does not have enough work to justify the parallelization overhead.
>
> Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

Do you have measurements?

>> Also, there will be false sharing between cache lines. It could be reduced by having the `#pragma omp parallel` outside the matrix multiplication, and only `#pragma omp for` on $loop_4$. However, Polly does not support that yet.
>
> Could you elaborate on why there will be false sharing between cache lines?

The innermost kernel:
`C[..][j_c] += ...`

When parallelizing the `j_c` loop,
`C[..][j_c+N_r - 1]` and `C[..][j_c+N_r]` might be in the same cache line, but processed by different threads. To avoid, we'd need to ensure that `N_r` is larger than the cache lines of and levels and also align to that size.

>> The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.
>
> Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

Do you have measurements?

Polly inserts `#pragma omp parallel for` at the the parallized which introduces thread fork-and-join overhead. It would be cheaper to have just one `#pragma omp parallel` at the outermost level (to spawn threads) and only `#pragma omp for` for $loop_4$. Polly currently does not support this.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125202/new/

https://reviews.llvm.org/D125202