[libcxx-dev] OpenMP parallel reduce bugs

Dvorskiy, Mikhail via libcxx-dev libcxx-dev at lists.llvm.org
Mon Oct 5 01:22:16 PDT 2020


Hi Christopher,
I’ve double check the code of __pattern_is_partitioned (which is based on the reduction parallel pattern). Yes, a binary operation is not commutative. So, my hypo was right.
Generally speaking the writing “manually”  reduction pattern w/o OpenMP s reducer is not good approach due to it may be not effective. Indeed, if we consider your example – the second loop (for) combines the results in serial mode, and std::vector brings additional overheads…

Once more, OpenMP reduction requires commutative binary operation and it is right. With PSTL design perspective an algorithm pattern should not rely on a fact that a parallel reduction pattern (which is provided by a parallel backend) support a non-commutative binary operation. So, it is an issue of __pattern_is_partitioned and we will fix it. So while I would suggest don’t worry about that.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com>
Sent: Saturday, October 3, 2020 6:19 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Hello again,

I was able to rewrite the parallel_reduce function in a way that works without using OpenMP's reducer. I have a couple of questions:

1. I use a vector to gather the intervening results for later reduction. Is there any problem depending on vector here?
2. I can see that it might make sense to build a taskloop for the actual reduction if the number of chunks is quite large. Is that something that I should look into more?

The code is below. Please let me know if you have any questions or concerns.


//------------------------------------------------------------------------
// parallel_reduce
//------------------------------------------------------------------------

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    std::vector<_Value> __values(__n_chunks);

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop shared(__values))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __values[__chunk] = __real_body(__begin, __end, __identity);
    }

    auto __result = __values.front();
    for (auto p = __values.begin() + 1; p != __values.end(); ++p)
    {
        __result = __reduction(__result, *p);
    }

    return __result;
}

On Fri, Oct 2, 2020 at 1:33 PM Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>> wrote:
Thank you. I wondered if you had an update on this. I've done some further looking, and I think that is correct. I've tried to find example implementations of performing reductions with openmp that don't require a commutative operator. It seems like rewriting the is_partioned algorithm to provide a commutative operator might be a larger / undesirable change.

Do you have any guidance on manually writing a task loop in openmp that performs the reduction without requiring commutativity?

Thanks!

-={C}=-

On Thu, Oct 1, 2020 at 9:11 AM Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,

Yes,  “is_partitioned” algo implementation is based on a reduction parallel pattern.
And it looks that a binary operation (combiner)  is not commutative.

In general, “reduction” algorithm requires a commutative binary operation. And OpenMP reduction requires that.
For TBB backend it works because TBB parallel reduction algorithm doesn’t require a commutative binary operation.

We (me or Evgeniy) will check that hypo and inform you.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Thursday, October 1, 2020 2:46 AM
To: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>
Cc: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: [libcxx-dev] OpenMP parallel reduce bugs

Hello friends,

I have been working on the OpenMP backend for the parallel STL, and most of the tests are passing. However, among the failures is the "is_partitioned" test. I have rewritten the __parallel_reduce backend function to be simpler to understand in an attempt to understand what is failing (code is below.)

I also rewrote it as a serial function that splits the iteration range in two and then calls __reduction() on each half of the range being passed in. The result I get from the serial execution as compared to the result I get from the parallel execution is different.

I have verified that the parallel execution tasks are run, and that their results match what each serial execution would be if I ran them that way.

I am wondering if there is something wrong with the way OpenMP is running the reducer here? Perhaps it is injecting a value into the computation that is unexpected for this algorithm? Does anything jump out at anyone as being suspicious?

Thank you again for your time and assistance!

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __item_count = __last - __first;
    std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;

    // We should encapsulate a result value and a reduction operator since we
    // cannot use a lambda in OpenMP UDR.
    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
    {
        auto __begin = __first + __i;
        auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201005/626aab06/attachment-0001.html>


More information about the libcxx-dev mailing list