<div dir="ltr">Oh also another thing that's important:<br><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">on top of manually interchanging the loops, which gives a ~3x speedup).</blockquote><div><br></div><div>Indeed in my machine too. Although I understand that this speedup is not the important thing here, it has to be noted what<br>GCC does with them interchanged:<br><br>Clang:<br>- Original: 2.04<br>- Interchanged: 0.60<br><br>G++:<br>- Original: 0.43<br>- Interchanged: 0.15<br><br>So, GCC gets a 3x speedup too and also the interchanged version of GCC is still 4x faster.<br><br>Best,<br>Stefanos</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Στις Παρ, 2 Οκτ 2020 στις 1:16 π.μ., ο/η Stefanos Baziotis <<a href="mailto:stefanos.baziotis@gmail.com">stefanos.baziotis@gmail.com</a>> έγραψε:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Florian,<br><br>Thanks for providing feedback.<br><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I filed <a href="https://bugs.llvm.org/show_bug.cgi?id=47705" rel="noreferrer" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=47705</a> to keep track of the issue.</blockquote><div><br></div><div>Great thanks I wasn't sure if it's considered a bug / regression / whatever.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones.</blockquote><div><br></div><div>Yes exactly, as I said in my first message SROA seems to be a big problem here. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different</blockquote><div><br>Well, definitely that's one thing and without interchange cache locality plays a big role but I think two things should be noted:<br>1) GCC, even though it interchanges the original code, still runs 25 iterations, so it still kind of does the benchmark.</div><div>2) Most importantly, even if we interchange the code ourselves, Clang's codegen is still very bad both compared to GCC and in general. Also, GCC does the obvious thing of completely removing the inner loop (now that with IV `run`), while</div><div>Clang doesn't. Basically, these are problems that I feel loop optimizations should not suffer from at this point :/<br><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version <a href="https://godbolt.org/z/nvq86W" target="_blank">https://godbolt.org/z/nvq86W</a></blockquote><div><br></div><div>Oh that's pretty neat. I mean for this code I don't think we should expect the user to have to write the code like that to get good codegen but it's still cool.</div><div><br></div><div>Cheers,<br>Stefanos </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <<a href="mailto:florian_hahn@apple.com" target="_blank">florian_hahn@apple.com</a>> έγραψε:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br><div><br><blockquote type="cite"><div>On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:</div><br><div><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">Hi,</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><blockquote type="cite" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br><br>Hi everyone,<br><br>I was watching this video [1]. There's an example of an initialization loop for which<br>Clang unfortunately generates really bad code [2]. In my machine, the Clang version<br>is 4x slower than the GCC version. I have not tested the MSVC version, but it should<br>be around the same.<br><br>In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39).<br><br>So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was<br>wondering if anyone knows more about that. It seems to me that the problem starts<br>with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down<br>the pipeline. Finally, the regalloc probably did not go very well.<br></blockquote><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">I filed<span> </span></span><a href="https://bugs.llvm.org/show_bug.cgi?id=47705" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=47705</a><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline"><span> </span>to keep track of the issue.</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different (and it also probably slightly defeats the purpose of the benchmark).</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup (on top of manually interchanging the loops, which gives a ~3x speedup).</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"></div></blockquote></div><br><div>Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version <a href="https://godbolt.org/z/nvq86W" target="_blank">https://godbolt.org/z/nvq86W</a> I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet)</div></div></blockquote></div>

</blockquote></div>