[llvm-bugs] [Bug 33869] New: Clang is not aware of a false dependency of POPCNT on desitnation register on Intel Skylake CPU

via llvm-bugs llvm-bugs at lists.llvm.org
Thu Jul 20 19:12:47 PDT 2017


https://bugs.llvm.org/show_bug.cgi?id=33869

            Bug ID: 33869
           Summary: Clang is not aware of a false dependency of POPCNT on
                    desitnation register on Intel Skylake CPU
           Product: clang
           Version: 4.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: C++
          Assignee: unassignedclangbugs at nondot.org
          Reporter: me at adhokshajmishraonline.in
                CC: dgregor at apple.com, llvm-bugs at lists.llvm.org

Created attachment 18826
  --> https://bugs.llvm.org/attachment.cgi?id=18826&action=edit
Test source code, dumped assmebler source code, and LLVM IR code

POPCNT instruction on Intel Skylake CPU seems have a false dependency on
destination register, resulting in a performance loss if destination register
is used immediately after POPCNT. The same bug has been present in Sandy
Bridge, Ivy Bridge and Haswell processors as well.

While G++ seems to be aware of dependency, it does not generate code where
false dependecy is triggerd. However, clang generated code gets hit immediately
due to false dependency coming up again and again.

Platform Details
================

CPU:       Intel(R) Core(TM) i7-6700HQ CPU
OS:        Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
           clang version 4.0.1 (tags/RELEASE_401/final)

Test Code
=========

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;

    uint64_t size = 10<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration =
chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "Counter\t"  << count << "\nSpeed\t" <<
(10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}


Code Generated by Clang
=======================

Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14

[code stripped]
        popcnt  rcx, qword ptr [r15 + 8*rax]
        add     rcx, rbx
        popcnt  rdx, qword ptr [r15 + 8*rax + 8]
        add     rdx, rcx
        popcnt  rcx, qword ptr [r15 + 8*rax + 16]
        add     rcx, rdx
        popcnt  rdx, qword ptr [r15 + 8*rax + 24]
        add     rdx, rcx
        popcnt  rcx, qword ptr [r15 + 8*rax + 32]
        add     rcx, rdx
        popcnt  rdx, qword ptr [r15 + 8*rax + 40]
        add     rdx, rcx
        popcnt  rcx, qword ptr [r15 + 8*rax + 48]
        add     rcx, rdx
        popcnt  rbx, qword ptr [r15 + 8*rax + 56]
[code stripped]

In the above generated code, destination register of POPCNT is used in next
instruction (write only). Due to false dependency, next line does not execute
until destination register is ready for read (while we are only writing to it)

Code Generated by GCC
=====================

Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14

[code stripped]
        xor     eax, eax
        xor     ecx, ecx
        popcnt  rax, QWORD PTR [rdx]
        popcnt  rcx, QWORD PTR 8[rdx]
        add     rax, rcx
        xor     ecx, ecx
        popcnt  rcx, QWORD PTR 16[rdx]
        add     rdx, 32
        add     rax, rcx
        xor     ecx, ecx
        popcnt  rcx, QWORD PTR -8[rdx]
        add     rax, rcx
        add     r12, rax
        cmp     rdx, r13
[code stripped]

In the code generated by GCC, false dependency is triggered in only 2 cases (in
clang it is 7), resulting in faster performance.

The test code, dumped assembly code (dumped from compiler), and LLVM IR code is
attached herewith (in ZIP)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170721/f0936cec/attachment-0001.html>


More information about the llvm-bugs mailing list