<div dir="ltr">Hi Florian,<br><br>Thanks for providing feedback.<br><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I filed <a href="https://bugs.llvm.org/show_bug.cgi?id=47705" rel="noreferrer" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=47705</a> to keep track of the issue.</blockquote><div><br></div><div>Great thanks I wasn't sure if it's considered a bug / regression / whatever.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones.</blockquote><div><br></div><div>Yes exactly, as I said in my first message SROA seems to be a big problem here. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different</blockquote><div><br>Well, definitely that's one thing and without interchange cache locality plays a big role but I think two things should be noted:<br>1) GCC, even though it interchanges the original code, still runs 25 iterations, so it still kind of does the benchmark.</div><div>2) Most importantly, even if we interchange the code ourselves, Clang's codegen is still very bad both compared to GCC and in general. Also, GCC does the obvious thing of completely removing the inner loop (now that with IV `run`), while</div><div>Clang doesn't. Basically, these are problems that I feel loop optimizations should not suffer from at this point :/<br><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version <a href="https://godbolt.org/z/nvq86W" target="_blank">https://godbolt.org/z/nvq86W</a></blockquote><div><br></div><div>Oh that's pretty neat. I mean for this code I don't think we should expect the user to have to write the code like that to get good codegen but it's still cool.</div><div><br></div><div>Cheers,<br>Stefanos </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <<a href="mailto:florian_hahn@apple.com">florian_hahn@apple.com</a>> έγραψε:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><br><div><br><blockquote type="cite"><div>On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:</div><br><div><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">Hi,</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><blockquote type="cite" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none">On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br><br>Hi everyone,<br><br>I was watching this video [1]. There's an example of an initialization loop for which<br>Clang unfortunately generates really bad code [2]. In my machine, the Clang version<br>is 4x slower than the GCC version. I have not tested the MSVC version, but it should<br>be around the same.<br><br>In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39).<br><br>So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was<br>wondering if anyone knows more about that. It seems to me that the problem starts<br>with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down<br>the pipeline. Finally, the regalloc probably did not go very well.<br></blockquote><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">I filed<span> </span></span><a href="https://bugs.llvm.org/show_bug.cgi?id=47705" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=47705</a><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline"><span> </span>to keep track of the issue.</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different (and it also probably slightly defeats the purpose of the benchmark).</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none;float:none;display:inline">There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup (on top of manually interchanging the loops, which gives a ~3x speedup).</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant-caps:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration:none"></div></blockquote></div><br><div>Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version <a href="https://godbolt.org/z/nvq86W" target="_blank">https://godbolt.org/z/nvq86W</a> I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet)</div></div></blockquote></div>