[LLVMbugs] [Bug 15986] New: Stack arrays (C-style array and std::array) sometimes compile to ~50% slower code than std::vector for array access

Mon May 13 14:54:09 PDT 2013

http://llvm.org/bugs/show_bug.cgi?id=15986

            Bug ID: 15986
           Summary: Stack arrays (C-style array and std::array) sometimes
                    compile to ~50% slower code than std::vector for array
                    access
           Product: clang
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: LLVM Codegen
          Assignee: unassignedclangbugs at nondot.org
          Reporter: mschamer at zoho.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

Created attachment 10507
  --> http://llvm.org/bugs/attachment.cgi?id=10507&action=edit
Test case:  Comment/uncomment typedefs in main() function to test array vs.
vector.

I'm writing a complementary multiply-with carry random number generator that
can be configured with templates to use an std::vector or std::array (or
C-style array) for its internal state.

Strangely, Clang/LLVM generates significantly slower code when stack-based
arrays are used as a drop-in replacement for vectors.  GCC 4.8 generates
performance-comparable code for each, and Clang's vector-based code gives
comparable speed to GCC.  I'm not familiar enough with assembly to know whether
Clang is accessing stack arrays in a fundamentally suboptimal way when
surrounded by certain computations, or if the stack arrays are merely
triggering different code generation for the surrounding computations, but the
resulting assembly has a number of differences, and it's quite a bit worse with
stack arrays.

Compilation line:
    clang++ -Wall -Wextra -std=c++11 -msse4.1 -O3 SlowerArrayExample.cpp

Test line:
    time ./a.out

Array Timing:
    real    0m2.875s
    user    0m2.833s
    sys    0m0.002s

Vector Timing:
    real    0m1.994s
    user    0m1.969s
    sys    0m0.002s

I've controlled for alignment/cache issues by using alignas(128) with the stack
arrays, and I've controlled for initialization caching issues by
zero-initializing in both cases.  My original test program used clock_gettime()
around the test loop only, so the issue is related to the array accesses
(either reads or writes) in the test loop, not array/vector construction or
initialization.

This is pretty unintuitive:  I'd expect array/vector access to be equivalent. 
If there's any performance difference at all, I'd expect it to work in slight
favor of stack arrays, since the start address of a stack array can be directly
calculated based on the stack pointer, whereas the start address of a vector
needs to be loaded from a member pointer.

I've attached the most minimal problem case I can find, but I've tried several
things to further isolate the "triggers" for this behavior, and the following
information may be helpful:

WHAT SOURCE CHANGES EQUALIZE VECTOR/ARRAY IMPLEMENTATIONS?
The slowdown with stack arrays disappears if I remove all of the computations
and turn rand() into something silly like this:
    result_type rand()
    {
        //  Not semantically valid:
        const size_t next_circular_index = (circular_index_ == (r - 1)) ? 0 :
circular_index_ + 1;
        result_type result = state_[circular_index_] + a;
        state_[circular_index_] = result;
        circular_index_ = next_circular_index;
        return result;
    }

In other words, the slow stack array accesses only emerge in the presence of
some minimal amount of surrounding computation.  Also, the array and vector
versions perform equally [slowly] if I replace the ternary operator (in the
original code) with the following:
        const size_t next_circular_index = (circular_index_ + 1) % r;

Using a full modulus operation makes the generator run a full three times
slower than the original vector version, but the array/vector versions do
technically perform equally in that case.  However, the array slowdown does not
seem specifically related to the branch in the ternary operator, because arrays
are still slower than vectors if I replace the line with the following (valid
only for power-of-2 r's):
        const size_t next_circular_index = (circular_index_ + 1) & (r - 1);

WHAT SOURCE CHANGES SEEM IRRELEVANT?
The slowdown doesn't seem to be a result of suboptimal source code, because
it's pretty much unaffected by source-level micro-optimization.

That is to say, I've tried all of the following in an attempt to debug this (in
addition to what I mentioned above):
- Saving every member/array read to a temporary local variable and rearranging
the code lines into literally 600 different semantically valid permutations
- Manually replacing division and modulus with shift and "and" operations for
base 2^32 (and something a bit more complicated for base 2^32 - 1)
- Manually replacing multiplication with shift-and-add/subtract-shifted for
relevant multiplier 'a' values (multipliers that are powers of 2 away from
powers of 2)

Some of these source-level tweaks affect the overall speed of the generator,
but the stack version always incurs a slowdown compared to the vector version. 
The slowdown doesn't seem dependent on the specific presence of multiply or
division operators in the source.  However, it could still be dependent on
shift instructions, "and" instructions, branches, or combinations thereof,
since the arithmetically gutted code above doesn't reproduce the problem, and
neither does the [slower] code using a modulus operation to update the circular
buffer index.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20130513/4dc9a493/attachment.html>