<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Thanks for checking. :)</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

What matters most is where in the app the run time is spent (i.e. which code paths are hot) and that must be the vectorised loop which processes 16 elements in parallel using SIMD instructions. If we compare GCC O2 with Clang O3, then we compare an efficient

 scalar loop with a vectorised loop and see a ~5x improvement. That's pretty decent, but is some way off from a theoretical 16x speed up and there could be many reasons for that. First, not all time is spent in the kernel, and some time is spent in all the

 setup code before it enters the loop, or the vectorised codegen is not efficient enough, or it is waiting for data. I haven't looked into details (the code and experimental setup) so this is a bit hand waivy, but I guess this must be the gist of it.</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Cheers.<br>

</div>

<div id="appendonsend"></div>

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Dennis Luehring <dl.soluz@gmx.net><br>

<b>Sent:</b> 04 December 2021 06:27<br>

<b>To:</b> Sjoerd Meijer <Sjoerd.Meijer@arm.com>; cfe-dev <cfe-dev@lists.llvm.org><br>

<b>Subject:</b> Re: [cfe-dev] clang generates way more code than - Optimizer bug?</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">Am 03.12.2021 um 14:58 schrieb Sjoerd Meijer:<br>

> I guess the only way to tell is to run and measure it.<br>

<br>

<br>

yeah, the clang code is faster than gcc - even with the much larger code<br>

- as you told :)<br>

<br>

thank you<br>

<br>

<br>

gcc O1: 9.782262 seconds<br>

gcc O2: 8.115871 seconds<br>

gcc O3: 3.092142 seconds<br>

<br>

clang O1: 9.905967 seconds<br>

clang O2: 1.629295 seconds<br>

clang O3: 1.629502 seconds<br>

<br>

<br>

benchmark-code:<br>

<br>

<br>

#include <stdio.h><br>

#include <time.h><br>

<br>

void decipher(unsigned char* text_, int text_len_)<br>

{<br>

   for (int i = text_len_ - 1; i >= 0; --i)<br>

   {<br>

     text_[i] ^= 0xff - (i << 2);<br>

   }<br>

}<br>

<br>

int benchmark()<br>

{<br>

   unsigned char text[] = { 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77,<br>

0x88, 0x99, 0xAA, 0xBB, 0xCC, 0xDD, 0xEE, 0xFF };<br>

<br>

   int anti_optimizer = 0;<br>

   for (int i = 0; i < 1000000000; ++i)<br>

   {<br>

     decipher(text, sizeof(text));<br>

<br>

     for (int x = 0; x < sizeof(text); ++x)<br>

     {<br>

       anti_optimizer += text[x];<br>

     }<br>

   }<br>

   return anti_optimizer;<br>

}<br>

<br>

int main()<br>

{<br>

   clock_t start_time = clock();<br>

   int result = benchmark();<br>

   double elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC;<br>

   printf("Done in %f seconds\n", elapsed_time);<br>

   return result;<br>

}<br>

<br>

<br>

<br>

<br>

</div>

</span></font></div>

</body>

</html>