[libcxx-dev] OpenMP parallel reduce bugs

Sat Oct 31 07:33:17 PDT 2020

Hi Mikhail,

I used the patch, but the unit tests are failing. Maybe I applied the patch
badly, or perhaps I did something wrong in the reducer. I rewrote the
parallel reduce function as:

template <class _RandomAccessIterator, class _Value, typename
_RealBody, typename _Reduction>
_Value
__parallel_reduce_body(_RandomAccessIterator __first,
_RandomAccessIterator __last, _Value __identity,
                       _RealBody __real_body, _Reduction __reduction)
{
    std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
    __omp_backend::__chunk_partitioner(__first, __last, __n_chunks,
__chunk_size, __first_chunk_size);

    using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
    _CombinerType __result{__identity, &__reduction};
    _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)

    // To avoid over-subscription we use taskloop for the nested parallelism
    _PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))
    for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
    {
        auto __this_chunk_size = __chunk == 0 ? __first_chunk_size :
__chunk_size;
        auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) +
(__first_chunk_size - __chunk_size);
        auto __begin = __first + __index;
        auto __end = __begin + __this_chunk_size;
        __result.__value = __real_body(__begin, __end, __identity);
    }

    return __result.__value;
}

On Sat, Oct 31, 2020 at 9:35 AM Christopher Nelson <nadiasvertex at gmail.com>
wrote:

> I was able to puzzle out the patch, and it seems to compile now. I'll
> rewrite the reducer and see how it goes. Thanks!
>
> On Fri, Oct 30, 2020 at 4:29 PM Christopher Nelson <nadiasvertex at gmail.com>
> wrote:
>
>> Hi Mikhail,
>>
>> I tried to apply the patch today, but it doesn't apply cleanly. May I ask
>> what version of the repo I should use? After I apply it (skipping a hunk
>> that doesn't apply) I get some compile errors:
>>
>>
>> /home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1780:13:
>> error: unknown type name '_ReduceRes'
>>             _ReduceRes __table[] = {__broken,     __broken,     __broken,
>>     __broken, __broken,    __all_true,
>>             ^
>> /home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1783:13:
>> error: use of '_ReduceType' with tag type that does not match previous
>> declaration
>>             struct _ReduceType
>>             ^
>> /home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1770:18:
>> note: previous use is here
>>             enum _ReduceType
>>                  ^
>> /home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1802:54:
>> error: use of undeclared identifier '__val1'; did you mean '__value'?
>>                         return _ReduceType{__broken, __val1.__pos};
>>                                                      ^~~~~~
>>                                                      __value
>> /home/christopher/projects/llvm-project/pstl/include/pstl/internal/algorithm_impl.h:1799:73:
>> note: '__value' declared here
>>                                                             _ReduceType
>> __value) -> _ReduceType {
>>
>> On Wed, Oct 21, 2020 at 9:36 AM Dvorskiy, Mikhail <
>> mikhail.dvorskiy at intel.com> wrote:
>>
>>> Hi Christopher,
>>>
>>>
>>>
>>> Please find the attachment  - a patch which fixes the mentioned problem
>>> with is partitioned algo. I’ve not yet promoted it into LLVM repo, but you
>>> can check it now with OpenMP backend.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Mikhail Dvorskiy
>>>
>>>
>>>
>>> *From:* Christopher Nelson <nadiasvertex at gmail.com>
>>> *Sent:* Monday, October 5, 2020 3:01 PM
>>> *To:* Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
>>> *Cc:* Kukanov, Alexey <Alexey.Kukanov at intel.com>; Pavlov, Evgeniy <
>>> evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas
>>> Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
>>> *Subject:* Re: [libcxx-dev] OpenMP parallel reduce bugs
>>>
>>>
>>>
>>> Great, thanks!
>>>
>>>
>>>
>>> I appreciate it!
>>>
>>>
>>>
>>> On Mon, Oct 5, 2020, 04:22 Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
>>> wrote:
>>>
>>> Hi Christopher,
>>>
>>> I’ve double check the code of __pattern_is_partitioned (which is based
>>> on the reduction parallel pattern). Yes, a binary operation is not
>>> commutative. So, my hypo was right.
>>>
>>> Generally speaking the writing “manually”  reduction pattern w/o OpenMP
>>> s reducer is not good approach due to it may be not effective. Indeed, if
>>> we consider your example – the second loop (for) combines the results in
>>> serial mode, and std::vector brings additional overheads…
>>>
>>>
>>>
>>> Once more, OpenMP reduction requires commutative binary operation and it
>>> is right. With PSTL design perspective an algorithm pattern should not rely
>>> on a fact that a parallel reduction pattern (which is provided by a
>>> parallel backend) support a non-commutative binary operation. So, it is an
>>> issue of __pattern_is_partitioned and we will fix it. So while I would
>>> suggest don’t worry about that.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Mikhail Dvorskiy
>>>
>>>
>>>
>>> *From:* Christopher Nelson <nadiasvertex at gmail.com>
>>> *Sent:* Saturday, October 3, 2020 6:19 PM
>>> *To:* Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>
>>> *Cc:* Kukanov, Alexey <Alexey.Kukanov at intel.com>; Pavlov, Evgeniy <
>>> evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas
>>> Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
>>> *Subject:* Re: [libcxx-dev] OpenMP parallel reduce bugs
>>>
>>>
>>>
>>> Hello again,
>>>
>>>
>>>
>>> I was able to rewrite the parallel_reduce function in a way that works
>>> without using OpenMP's reducer. I have a couple of questions:
>>>
>>>
>>>
>>> 1. I use a vector to gather the intervening results for later reduction.
>>> Is there any problem depending on vector here?
>>>
>>> 2. I can see that it might make sense to build a taskloop for the actual
>>> reduction if the number of chunks is quite large. Is that something that I
>>> should look into more?
>>>
>>>
>>>
>>> The code is below. Please let me know if you have any questions or
>>> concerns.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *//------------------------------------------------------------------------// parallel_reduce//------------------------------------------------------------------------*template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
>>> _Value
>>> __parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
>>>                        _RealBody __real_body, _Reduction __reduction)
>>> {
>>>     std::size_t __n_chunks{0}, __chunk_size{0}, __first_chunk_size{0};
>>>     __chunk_partitioner(__first, __last, __n_chunks, __chunk_size, __first_chunk_size);
>>>
>>>     std::vector<_Value> __values(__n_chunks);
>>>
>>>
>>> *// To avoid over-subscription we use taskloop for the nested parallelism    *_PSTL_PRAGMA(omp taskloop shared(__values))
>>>     for (std::size_t __chunk = 0; __chunk < __n_chunks; ++__chunk)
>>>     {
>>>         auto __this_chunk_size = __chunk == 0 ? __first_chunk_size : __chunk_size;
>>>         auto __index = __chunk == 0 ? 0 : (__chunk * __chunk_size) + (__first_chunk_size - __chunk_size);
>>>         auto __begin = __first + __index;
>>>         auto __end = __begin + __this_chunk_size;
>>>         __values[__chunk] = __real_body(__begin, __end, __identity);
>>>     }
>>>
>>>     auto __result = __values.front();
>>>     for (auto p = __values.begin() + 1; p != __values.end(); ++p)
>>>     {
>>>         __result = __reduction(__result, *p);
>>>     }
>>>
>>>     return __result;
>>> }
>>>
>>>
>>>
>>> On Fri, Oct 2, 2020 at 1:33 PM Christopher Nelson <
>>> nadiasvertex at gmail.com> wrote:
>>>
>>> Thank you. I wondered if you had an update on this. I've done some
>>> further looking, and I think that is correct. I've tried to find example
>>> implementations of performing reductions with openmp that don't require a
>>> commutative operator. It seems like rewriting the is_partioned algorithm to
>>> provide a commutative operator might be a larger / undesirable change.
>>>
>>>
>>>
>>> Do you have any guidance on manually writing a task loop in openmp that
>>> performs the reduction without requiring commutativity?
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> -={C}=-
>>>
>>>
>>>
>>> On Thu, Oct 1, 2020 at 9:11 AM Dvorskiy, Mikhail <
>>> mikhail.dvorskiy at intel.com> wrote:
>>>
>>> Hi Christopher,
>>>
>>>
>>>
>>> Yes,  “is_partitioned” algo implementation is based on a reduction
>>> parallel pattern.
>>>
>>> And it looks that a binary operation (combiner)  is not commutative.
>>>
>>>
>>>
>>> In general, “reduction” algorithm requires a commutative binary
>>> operation. And OpenMP reduction requires that.
>>>
>>> For TBB backend it works because TBB parallel reduction algorithm
>>> doesn’t require a commutative binary operation.
>>>
>>>
>>>
>>> We (me or Evgeniy) will check that hypo and inform you.
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Mikhail Dvorskiy
>>>
>>>
>>>
>>> *From:* Christopher Nelson <nadiasvertex at gmail.com>
>>> *Sent:* Thursday, October 1, 2020 2:46 AM
>>> *To:* Kukanov, Alexey <Alexey.Kukanov at intel.com>
>>> *Cc:* Dvorskiy, Mikhail <mikhail.dvorskiy at intel.com>; Pavlov, Evgeniy <
>>> evgeniy.pavlov at intel.com>; Louis Dionne <ldionne at apple.com>; Thomas
>>> Rodgers <trodgers at redhat.com>; Libc++ Dev <libcxx-dev at lists.llvm.org>
>>> *Subject:* [libcxx-dev] OpenMP parallel reduce bugs
>>>
>>>
>>>
>>> Hello friends,
>>>
>>>
>>>
>>> I have been working on the OpenMP backend for the parallel STL, and most
>>> of the tests are passing. However, among the failures is the
>>> "is_partitioned" test. I have rewritten the __parallel_reduce backend
>>> function to be simpler to understand in an attempt to understand what is
>>> failing (code is below.)
>>>
>>>
>>>
>>> I also rewrote it as a serial function that splits the iteration range
>>> in two and then calls __reduction() on each half of the range being passed
>>> in. The result I get from the serial execution as compared to the result I
>>> get from the parallel execution is different.
>>>
>>>
>>>
>>> I have verified that the parallel execution tasks are run, and that
>>> their results match what each serial execution would be if I ran them that
>>> way.
>>>
>>>
>>>
>>> I am wondering if there is something wrong with the way OpenMP is
>>> running the reducer here? Perhaps it is injecting a value into the
>>> computation that is unexpected for this algorithm? Does anything jump out
>>> at anyone as being suspicious?
>>>
>>>
>>>
>>> Thank you again for your time and assistance!
>>>
>>> template <class _RandomAccessIterator, class _Value, typename _RealBody, typename _Reduction>
>>> _Value
>>> __parallel_reduce_body(_RandomAccessIterator __first, _RandomAccessIterator __last, _Value __identity,
>>>                        _RealBody __real_body, _Reduction __reduction)
>>> {
>>>     std::size_t __item_count = __last - __first;
>>>     std::size_t __head_items = (__item_count / __default_chunk_size) * __default_chunk_size;
>>>
>>>
>>>
>>> *// We should encapsulate a result value and a reduction operator since we    // cannot use a lambda in OpenMP UDR.    *using _CombinerType = __pstl::__internal::_Combiner<_Value, _Reduction>;
>>>     _CombinerType __result{__identity, &__reduction};
>>>     _PSTL_PRAGMA_DECLARE_REDUCTION(__combiner, _CombinerType)
>>>
>>>
>>>
>>> *// To avoid over-subscription we use taskloop for the nested parallelism    //_PSTL_PRAGMA(omp taskloop reduction(__combiner : __result))    *for (std::size_t __i = 0; __i < __item_count; __i += __default_chunk_size)
>>>     {
>>>         auto __begin = __first + __i;
>>>         auto __end = __i < __head_items ? __begin + __default_chunk_size : __last;
>>>         __result.__value = __real_body(__begin, __end, __identity);
>>>     }
>>>
>>>     return __result.__value;
>>> }
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201031/1d7c0a89/attachment-0001.html>