[libcxx-dev] OpenMP parallel reduce bugs

Dvorskiy, Mikhail via libcxx-dev libcxx-dev at lists.llvm.org
Thu Oct 1 06:11:54 PDT 2020

Hi Christopher,

Yes,  “is_partitioned” algo implementation is based on a reduction parallel pattern.
And it looks that a binary operation (combiner)  is not commutative.

In general, “reduction” algorithm requires a commutative binary operation. And OpenMP reduction requires that.
For TBB backend it works because TBB parallel reduction algorithm doesn’t require a commutative binary operation.

We (me or Evgeniy) will check that hypo and inform you.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com>
Sent: Thursday, October 1, 2020 2:46 AM
To: Kukanov, Alexey <Alexey.Kukanov at intel.com>
Cc: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
Subject: [libcxx-dev] OpenMP parallel reduce bugs

Hello friends,

I have been working on the OpenMP backend for the parallel STL, and most of the tests are passing. However, among the failures is the "is_partitioned" test. I have rewritten the __parallel_reduce backend function to be simpler to understand in an attempt to understand what is failing (code is below.)

I also rewrote it as a serial function that splits the iteration range in two and then calls __reduction() on each half of the range being passed in. The result I get from the serial execution as compared to the result I get from the parallel execution is different.

I have verified that the parallel execution tasks are run, and that their results match what each serial execution would be if I ran them that way.

I am wondering if there is something wrong with the way OpenMP is running the reducer here? Perhaps it is injecting a value into the computation that is unexpected for this algorithm? Does anything jump out at anyone as being suspicious?

Thank you again for your time and assistance!

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
    std::size_t __item_count = __last - __first;
    std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;

    // We should encapsulate a result value and a reduction operator since we
    // cannot use a lambda in OpenMP UDR.
    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
        auto __begin = __first + __i;
        auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
        __result.__value = __real_body(__begin, __end, __identity);

    return __result.__value;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201001/8cbc0dfe/attachment-0001.html>

More information about the libcxx-dev mailing list