<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Inefficient code generated for NEON function computing GNU symbol hash"

   href="https://bugs.llvm.org/show_bug.cgi?id=43810">43810</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Inefficient code generated for NEON function computing GNU symbol hash

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: ARM

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>rprichard@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, peter.smith@linaro.org, Ties.Stuij@arm.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=22728" name="attach_22728" title="Archive of GNU hash function implementions and build/run scripts">attachment 22728</a> <a href="attachment.cgi?id=22728&action=edit" title="Archive of GNU hash function implementions and build/run scripts">[details]</a></span>

Archive of GNU hash function implementions and build/run scripts

I wrote a NEON-optimized version of a function that computes the GNU hash value

for a symbol name, and Clang's version of the function is slower than what GCC

generates (or what I can do with hand-written assembly).

I'm not quite sure what LLVM is doing that's making it slower. I did notice

that my hand-written assembly doesn't create a stack frame, whereas both GCC

and Clang need one.

Details:

I'm working on making the Bionic dynamic linker's GNU hash calculation faster,

because it takes a significant portion of the total linker run-time. (At one

point, I measured it taking 20% of the total run-time doing the initial linking

of cameraserver.)

The linker currently uses a simple function to calculate the hash.

uint32_t SymbolName::gnu_hash() {

  if (!has_gnu_hash_) {

    uint32_t h = 5381;

    const uint8_t* name = reinterpret_cast<const uint8_t*>(name_);

    while (*name != 0) {

      h += (h << 5) + *name++; // h*33 + c = h + h * 32 + c = h + h << 5 + c

    }

    gnu_hash_ =  h;

    has_gnu_hash_ = true;

  }

  return gnu_hash_;

}

Using hand-written arm32 Neon assembly, I wrote a function that takes 30-50%

less time than the simple C++ version. Using C++ code with Neon intrinsics

instead, I can write something that's still faster than the simple C++ version,

but has about half the improvement when I compile with Clang. GCC, on the other

hand, gets much closer to my hand-written assembly.

Here are some numbers on an arm32-only Go phone. I used the "performance"

scaling governor. I used the <a href="https://tratt.net/laurie/src/multitime">https://tratt.net/laurie/src/multitime</a> utility to

run benchmarks repeatedly and calculate confidence intervals.

Clang, simple C function: 0.441+/-0.0001 (in seconds of wall clock time)

GCC, simple C function: 0.376+/-0.0001

Clang, using Neon intrinsics: 0.373+/-0.0001 (Clang ignored pragma unroll)

GCC, using Neon intrinsics: 0.330+/-0.0001 (w/ no pragma GCC unroll)

GCC, using Neon intrinsics: 0.312+/-0.0003 (w/ pragma GCC unroll 8)

Handwritten assembly: 0.311+/-0.0001

I also looked at a walleye Pixel 2 device (core 4, one of the fast ones). For

arm32:

Clang, simple C function: 0.347+/-0.0023

GCC, simple C function: 0.323+/-0.0021

Clang, using Neon intrinsics: 0.225+/-0.0013

GCC, using Neon intrinsics: 0.208+/-0.0013 (w/ no pragma GCC unroll)

GCC, using Neon intrinsics: 0.186+/-0.0007 (w/ pragma GCC unroll 8)

Handwritten assembly: 0.176+/-0.0013

I don't have handwritten assembly for arm64, but I benchmarked the C++  code.

Clang, simple C function: 0.308+/-0.0017

GCC, simple C function: 0.285+/-0.0018

Clang, using Neon intrinsics: 0.205+/-0.0016 (Clang ignored pragma unroll)

GCC, using Neon intrinsics: 0.189+/-0.0010 (w/ no pragma GCC unroll)

GCC, using Neon intrinsics: 0.217+/-0.0015 (w/ pragma GCC unroll 4)

GCC, using Neon intrinsics: 0.214+/-0.0004 (w/ pragma GCC unroll 8)

I attached a tarball with the source code, Makefile, and a couple of scripts

for running the benchmarks via adb.

I also uploaded three assembly files:

 - my hand-crafted arm32 assembly

 - the output from NDK r21 beta 1's compiler (Clang as of r365631)

 - the output from arm-linux-gnueabi-gcc-8 8.3.0 from my gLinux machine</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>