[libcxx-dev] OpenMP parallel reduce bugs

Mon Nov 2 00:34:47 PST 2020

Hi Christopher,

I will try to investigate what is wrong…
Could you please send me a your merged version of algorithm_impl.h? I’ll check it as well.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com>
Sent: Saturday, October 31, 2020 5:33 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Hi Mikhail,

I used the patch, but the unit tests are failing. Maybe I applied the patch badly, or perhaps I did something wrong in the reducer. I rewrote the parallel reduce function as:

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __omp_backend::__chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}

On Sat, Oct 31, 2020 at 9:35 AM Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>> wrote:
I was able to puzzle out the patch, and it seems to compile now. I'll rewrite the reducer and see how it goes. Thanks!

On Fri, Oct 30, 2020 at 4:29 PM Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>> wrote:
Hi Mikhail,

I tried to apply the patch today, but it doesn't apply cleanly. May I ask what version of the repo I should use? After I apply it (skipping a hunk that doesn't apply) I get some compile errors:

/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1780:13: error: unknown type name '_ReduceRes'
            _ReduceRes __table[] = {__broken,     __broken,     __broken,     __broken, __broken,    __all_true,
            ^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1783:13: error: use of '_ReduceType' with tag type that does not match previous declaration
            struct _ReduceType
            ^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1770:18: note: previous use is here
            enum _ReduceType
                 ^
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1802:54: error: use of undeclared identifier '__val1'; did you mean '__value'?
                        return _ReduceType{__broken, __val1.__pos};
                                                     ^~~~~~
                                                     __value
/home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1799:73: note: '__value' declared here
                                                            _ReduceType __value) -> _ReduceType {

On Wed, Oct 21, 2020 at 9:36 AM Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,

Please find the attachment  - a patch which fixes the mentioned problem with is partitioned algo. I’ve not yet promoted it into LLVM repo, but you can check it now with OpenMP backend.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Monday, October 5, 2020 3:01 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Great, thanks!

I appreciate it!

On Mon, Oct 5, 2020, 04:22 Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,
I’ve double check the code of __pattern_is_partitioned (which is based on the reduction parallel pattern). Yes, a binary operation is not commutative. So, my hypo was right.
Generally speaking the writing “manually”  reduction pattern w/o OpenMP s reducer is not good approach due to it may be not effective. Indeed, if we consider your example – the second loop (for) combines the results in serial mode, and std::vector brings additional overheads…

Once more, OpenMP reduction requires commutative binary operation and it is right. With PSTL design perspective an algorithm pattern should not rely on a fact that a parallel reduction pattern (which is provided by a parallel backend) support a non-commutative binary operation. So, it is an issue of __pattern_is_partitioned and we will fix it. So while I would suggest don’t worry about that.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Saturday, October 3, 2020 6:19 PM
To: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>
Cc: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: Re: [libcxx-dev] OpenMP parallel reduce bugs

Hello again,

I was able to rewrite the parallel_reduce function in a way that works without using OpenMP's reducer. I have a couple of questions:

1. I use a vector to gather the intervening results for later reduction. Is there any problem depending on vector here?
2. I can see that it might make sense to build a taskloop for the actual reduction if the number of chunks is quite large. Is that something that I should look into more?

The code is below. Please let me know if you have any questions or concerns.

//------------------------------------------------------------------------
// parallel_reduce
//------------------------------------------------------------------------

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);

    std::vector<_Value> __values(__n_chunks);

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop shared(__values))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __values[__chunk] = __real_body(__begin, __end, __identity);
    }

    auto __result = __values.front();
    for (auto p = __values.begin() + 1; p != __values.end(); ++p)
    {
        __result = __reduction(__result, *p);
    }

    return __result;
}

On Fri, Oct 2, 2020 at 1:33 PM Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>> wrote:
Thank you. I wondered if you had an update on this. I've done some further looking, and I think that is correct. I've tried to find example implementations of performing reductions with openmp that don't require a commutative operator. It seems like rewriting the is_partioned algorithm to provide a commutative operator might be a larger / undesirable change.

Do you have any guidance on manually writing a task loop in openmp that performs the reduction without requiring commutativity?

Thanks!

-={C}=-

On Thu, Oct 1, 2020 at 9:11 AM Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>> wrote:
Hi Christopher,

Yes,  “is_partitioned” algo implementation is based on a reduction parallel pattern.
And it looks that a binary operation (combiner)  is not commutative.

In general, “reduction” algorithm requires a commutative binary operation. And OpenMP reduction requires that.
For TBB backend it works because TBB parallel reduction algorithm doesn’t require a commutative binary operation.

We (me or Evgeniy) will check that hypo and inform you.

Best regards,
Mikhail Dvorskiy

From: Christopher Nelson <nadiasvertex at gmail.com<mailto:nadiasvertex at gmail.com>>
Sent: Thursday, October 1, 2020 2:46 AM
To: Kukanov, Alexey <Alexey.Kukanov at intel.com<mailto:Alexey.Kukanov at intel.com>>
Cc: Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com<mailto:mikhail.dvorskiy at intel.com>>; Pavlov, Evgeniy <evgeniy.pavlov at intel.com<mailto:evgeniy.pavlov at intel.com>>; Louis Dionne <ldionne at apple.com<mailto:ldionne at apple.com>>; Thomas Rodgers <trodgers at redhat.com<mailto:trodgers at redhat.com>>; Libc++ Dev <libcxx-dev at lists.llvm.org<mailto:libcxx-dev at lists.llvm.org>>
Subject: [libcxx-dev] OpenMP parallel reduce bugs

Hello friends,

I have been working on the OpenMP backend for the parallel STL, and most of the tests are passing. However, among the failures is the "is_partitioned" test. I have rewritten the __parallel_reduce backend function to be simpler to understand in an attempt to understand what is failing (code is below.)

I also rewrote it as a serial function that splits the iteration range in two and then calls __reduction() on each half of the range being passed in. The result I get from the serial execution as compared to the result I get from the parallel execution is different.

I have verified that the parallel execution tasks are run, and that their results match what each serial execution would be if I ran them that way.

I am wondering if there is something wrong with the way OpenMP is running the reducer here? Perhaps it is injecting a value into the computation that is unexpected for this algorithm? Does anything jump out at anyone as being suspicious?

Thank you again for your time and assistance!

template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __item_count = __last - __first;
    std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;

    // We should encapsulate a result value and a reduction operator since we
    // cannot use a lambda in OpenMP UDR.
    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
    {
        auto __begin = __first + __i;
        auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201102/6d8a0894/attachment-0001.html>