[libcxx-dev] is_partioned algorithm

Thu Oct 1 05:26:11 PDT 2020

Hi friends,new_state = table[left_state * 4 + right_state]

I have rewritten the __parallel_reduce function for OpenMP to avoid using
the reduction() clause. I think that OpenMP is not feeding the
output values to the reduction in the way that one might think. I am
tracking that problem in a different thread. In any case, the loop without
the reduction, and then performing an explicit reduction at the end gets
_much_ farther.

In this thread, I would like to talk about the is_partioned algorithm.
Briefly, I can see that there is a main body which performs some processing
and then returns a result which is a value in a lookup table.

The brick body for is_partioned will return one of:

enum _ReduceType
{
    __not_init = -1,

    __broken,

    __all_true,
    __all_false,
    __true_false
};

In the case I am trying to debug there 521 elements. The simple, naive
chunk partitioner I wrote breaks this up into two chunks: 0-511, and
512-521.

The first chunk comes back as "2" (__all_true) and "3" (__all_false) if I
am reading the enum right. The reducer is called with __reduction(2,3)
which performs the following table lookup:

new_state = table[left_state * 4 + right_state]

In this case, it looks like it result in table entry 11:

_ReduceType __table[] = {
 __broken, __broken, __broken, __broken, __broken,  __all_true,
 __true_false, __true_false, __broken, __broken, __all_false, __broken,
 __broken, __broken, __true_false, __broken
};

Which turns out to be __broken. However, when I run the brick body on the
entire range at once, it results in "3". This seems to indicate that the
algorithm is very sensitive to how the range is chunked. I looked at the
TBB implementation and it seems to create a blocked range with a
grainsize of 1. That seems wrong, since the scheduling overhead would be
enormous, so clearly there is something going on that I don't understand.

However, since the reduction of two arbitrary chunks of a range appears to
give an answer that is different from the reduction of a symmetric split,
it seems like that may be a requirement. Is that true? Does
__parallel_reduce need split the range evenly in order to perform correctly?

-={C}=-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/libcxx-dev/attachments/20201001/e2861962/attachment.html>