<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Clang is not aware of a false dependency of POPCNT on desitnation register on Intel Skylake CPU"
href="https://bugs.llvm.org/show_bug.cgi?id=33869">33869</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Clang is not aware of a false dependency of POPCNT on desitnation register on Intel Skylake CPU
</td>
</tr>
<tr>
<th>Product</th>
<td>clang
</td>
</tr>
<tr>
<th>Version</th>
<td>4.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>C++
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedclangbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>me@adhokshajmishraonline.in
</td>
</tr>
<tr>
<th>CC</th>
<td>dgregor@apple.com, llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=18826" name="attach_18826" title="Test source code, dumped assmebler source code, and LLVM IR code">attachment 18826</a> <a href="attachment.cgi?id=18826&action=edit" title="Test source code, dumped assmebler source code, and LLVM IR code">[details]</a></span>
Test source code, dumped assmebler source code, and LLVM IR code
POPCNT instruction on Intel Skylake CPU seems have a false dependency on
destination register, resulting in a performance loss if destination register
is used immediately after POPCNT. The same bug has been present in Sandy
Bridge, Ivy Bridge and Haswell processors as well.
While G++ seems to be aware of dependency, it does not generate code where
false dependecy is triggerd. However, clang generated code gets hit immediately
due to false dependency coming up again and again.
Platform Details
================
CPU: Intel(R) Core(TM) i7-6700HQ CPU
OS: Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
clang version 4.0.1 (tags/RELEASE_401/final)
Test Code
=========
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size = 10<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration =
chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Counter\t" << count << "\nSpeed\t" <<
(10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
Code Generated by Clang
=======================
Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14
[code stripped]
popcnt rcx, qword ptr [r15 + 8*rax]
add rcx, rbx
popcnt rdx, qword ptr [r15 + 8*rax + 8]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 16]
add rcx, rdx
popcnt rdx, qword ptr [r15 + 8*rax + 24]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 32]
add rcx, rdx
popcnt rdx, qword ptr [r15 + 8*rax + 40]
add rdx, rcx
popcnt rcx, qword ptr [r15 + 8*rax + 48]
add rcx, rdx
popcnt rbx, qword ptr [r15 + 8*rax + 56]
[code stripped]
In the above generated code, destination register of POPCNT is used in next
instruction (write only). Due to false dependency, next line does not execute
until destination register is ready for read (while we are only writing to it)
Code Generated by GCC
=====================
Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14
[code stripped]
xor eax, eax
xor ecx, ecx
popcnt rax, QWORD PTR [rdx]
popcnt rcx, QWORD PTR 8[rdx]
add rax, rcx
xor ecx, ecx
popcnt rcx, QWORD PTR 16[rdx]
add rdx, 32
add rax, rcx
xor ecx, ecx
popcnt rcx, QWORD PTR -8[rdx]
add rax, rcx
add r12, rax
cmp rdx, r13
[code stripped]
In the code generated by GCC, false dependency is triggered in only 2 cases (in
clang it is 7), resulting in faster performance.
The test code, dumped assembly code (dumped from compiler), and LLVM IR code is
attached herewith (in ZIP)</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>