[PATCH] D125202: [Polly] Disable matmul pattern-match + -polly-parallel

Sun May 15 23:16:30 PDT 2022

gareevroman added a comment.

> I added this in aa8a976174c7ac08676bbc7bb647f6bc0efd2e72 and I think it does not actually make anything parallel, but I am not sure it is actually allowed due to `Packed_A` shared between all the threads.

As far as I know, parallelizing the outermost loop may be a good idea only on mulitsocket systems, where each CPU has a separate L3 cache (please, see [1]). Additionally, it would require to replicate Packed_A and Packed_B to avoid the race condition by using OpenMP parallelism.

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

If the L2 cache is not shared and the second loop is parallelized, then the elements of the created Packed_A are duplicated across the L2 caches. In this case, the first loop around the macro-kernel, can be considered. If we parallelize this loop, then each thread will be assigned different elements of the matrix A, which reside in the L2 cache, and they are packed into the created array Packed_A, where this may cause a race condition. So, replication of elements of Packed_A would require.

Probably, the best option is to add additional flag that specifies whether L2 cache is shared. Depending on it, we can use different parallelization strategies.  I think such a flag should be set to true by default.

Refs.:

[1] - Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 1049–1059. DOI:http://dx.doi. org/10.1109/IPDPS.2014.110

[2] - Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (August 2018), 27 pages. https://doi.org/10.1145/3235029{F23074533}

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125202/new/

https://reviews.llvm.org/D125202