[libcxx-dev] OpenMP parallel reduce bugs

Dvorskiy, Mikhail via libcxx-dev libcxx-dev at lists.llvm.org
Wed Oct 21 06:36:16 PDT 2020


Hi Christopher,

Please find the attachment  - a patch which fixes the mentioned problem with is partitioned algo. I’ve not yet promoted it into LLVM repo, but you can check it now with OpenMP backend.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com>
Sent: Monday, October 5, 2020 3:01 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Great, thanks!

I appreciate it!

On Mon, Oct 5, 2020, 04:22 Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,
I’ve double check the code of __pattern_is_partitioned (which is based on the reduction parallel pattern). Yes, a binary operation is not commutative. So, my hypo was right.
Generally speaking the writing “manually”  reduction pattern w/o OpenMP s reducer is not good approach due to it may be not effective. Indeed, if we consider your example – the second loop (for) combines the results in serial mode, and std::vector brings additional overheads…

Once more, OpenMP reduction requires commutative binary operation and it is right. With PSTL design perspective an algorithm pattern should not rely on a fact that a parallel reduction pattern (which is provided by a parallel backend) support a non-commutative binary operation. So, it is an issue of __pattern_is_partitioned and we will fix it. So while I would suggest don’t worry about that.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Saturday, October 3, 2020 6:19 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Hello again,

I was able to rewrite the parallel_reduce function in a way that works without using OpenMP's reducer. I have a couple of questions:

1. I use a vector to gather the intervening results for later reduction. Is there any problem depending on vector here?
2. I can see that it might make sense to build a taskloop for the actual reduction if the number of chunks is quite large. Is that something that I should look into more?

The code is below. Please let me know if you have any questions or concerns.


//------------------------------------------------------------------------
// parallel_reduce
//------------------------------------------------------------------------

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    std::vector<_Value> __values(__n_chunks);

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop shared(__values))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __values[__chunk] = __real_body(__begin, __end, __identity);
    }

    auto __result = __values.front();
    for (auto p = __values.begin() + 1; p != __values.end(); ++p)
    {
        __result = __reduction(__result, *p);
    }

    return __result;
}

On Fri, Oct 2, 2020 at 1:33 PM Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>> wrote:
Thank you. I wondered if you had an update on this. I've done some further looking, and I think that is correct. I've tried to find example implementations of performing reductions with openmp that don't require a commutative operator. It seems like rewriting the is_partioned algorithm to provide a commutative operator might be a larger / undesirable change.

Do you have any guidance on manually writing a task loop in openmp that performs the reduction without requiring commutativity?

Thanks!

-={C}=-

On Thu, Oct 1, 2020 at 9:11 AM Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,

Yes,  “is_partitioned” algo implementation is based on a reduction parallel pattern.
And it looks that a binary operation (combiner)  is not commutative.

In general, “reduction” algorithm requires a commutative binary operation. And OpenMP reduction requires that.
For TBB backend it works because TBB parallel reduction algorithm doesn’t require a commutative binary operation.

We (me or Evgeniy) will check that hypo and inform you.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Thursday, October 1, 2020 2:46 AM
To: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>
Cc: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: [libcxx-dev] OpenMP parallel reduce bugs

Hello friends,

I have been working on the OpenMP backend for the parallel STL, and most of the tests are passing. However, among the failures is the "is_partitioned" test. I have rewritten the __parallel_reduce backend function to be simpler to understand in an attempt to understand what is failing (code is below.)

I also rewrote it as a serial function that splits the iteration range in two and then calls __reduction() on each half of the range being passed in. The result I get from the serial execution as compared to the result I get from the parallel execution is different.

I have verified that the parallel execution tasks are run, and that their results match what each serial execution would be if I ran them that way.

I am wondering if there is something wrong with the way OpenMP is running the reducer here? Perhaps it is injecting a value into the computation that is unexpected for this algorithm? Does anything jump out at anyone as being suspicious?

Thank you again for your time and assistance!

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __item_count = __last - __first;
    std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;

    // We should encapsulate a result value and a reduction operator since we
    // cannot use a lambda in OpenMP UDR.
    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
    {
        auto __begin = __first + __i;
        auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201021/a5ea6cec/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: is_partitioned_commutative_bi_op.patch
Type: application/octet-stream
Size: 207707 bytes
Desc: is_partitioned_commutative_bi_op.patch
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201021/a5ea6cec/attachment-0001.obj>


More information about the libcxx-dev mailing list