[llvm-bugs] [Bug 33869] New: Clang is not aware of a false dependency of POPCNT on desitnation register on Intel Skylake CPU
via llvm-bugs
llvm-bugs at lists.llvm.org
Thu Jul 20 19:12:47 PDT 2017
https://bugs.llvm.org/show_bug.cgi?id=33869
Bug ID: 33869
Summary: Clang is not aware of a false dependency of POPCNT on
desitnation register on Intel Skylake CPU
Product: clang
Version: 4.0
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: C++
Assignee: unassignedclangbugs at nondot.org
Reporter: me at adhokshajmishraonline.in
CC: dgregor at apple.com, llvm-bugs at lists.llvm.org
Created attachment 18826
--> https://bugs.llvm.org/attachment.cgi?id=18826&action=edit
Test source code, dumped assmebler source code, and LLVM IR code
POPCNT instruction on Intel Skylake CPU seems have a false dependency on
destination register, resulting in a performance loss if destination register
is used immediately after POPCNT. The same bug has been present in Sandy
Bridge, Ivy Bridge and Haswell processors as well.
While G++ seems to be aware of dependency, it does not generate code where
false dependecy is triggerd. However, clang generated code gets hit immediately
due to false dependency coming up again and again.
Platform Details
================
CPU: Intel(R) Core(TM) i7-6700HQ CPU
OS: Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
clang version 4.0.1 (tags/RELEASE_401/final)
Test Code
=========
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size = 10<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration =
chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Counter\t" << count << "\nSpeed\t" <<
(10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
Code Generated by Clang
=======================
Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14
[code stripped]
popcnt rcx, qword ptr [r15 + 8*rax]
add rcx, rbx
popcnt rdx, qword ptr [r15 + 8*rax + 8]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 16]
add rcx, rdx
popcnt rdx, qword ptr [r15 + 8*rax + 24]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 32]
add rcx, rdx
popcnt rdx, qword ptr [r15 + 8*rax + 40]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 48]
add rcx, rdx
popcnt rbx, qword ptr [r15 + 8*rax + 56]
[code stripped]
In the above generated code, destination register of POPCNT is used in next
instruction (write only). Due to false dependency, next line does not execute
until destination register is ready for read (while we are only writing to it)
Code Generated by GCC
=====================
Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14
[code stripped]
xor eax, eax
xor ecx, ecx
popcnt rax, QWORD PTR [rdx]
popcnt rcx, QWORD PTR 8[rdx]
add rax, rcx
xor ecx, ecx
popcnt rcx, QWORD PTR 16[rdx]
add rdx, 32
add rax, rcx
xor ecx, ecx
popcnt rcx, QWORD PTR -8[rdx]
add rax, rcx
add r12, rax
cmp rdx, r13
[code stripped]
In the code generated by GCC, false dependency is triggered in only 2 cases (in
clang it is 7), resulting in faster performance.
The test code, dumped assembly code (dumped from compiler), and LLVM IR code is
attached herewith (in ZIP)
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170721/f0936cec/attachment-0001.html>
More information about the llvm-bugs
mailing list