[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Thu Oct 1 15:16:17 PDT 2020

Hi Florian,

Thanks for providing feedback.

I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
> issue.

Great thanks I wasn't sure if it's considered a bug / regression / whatever.

There’s also an issue with SROA which splits a nice single consecutive
> llvm.memcpy into 3 separate ones.

Yes exactly, as I said in my first message SROA seems to be a big problem
here.

While the code for the initialization is not ideal, it appears the main
> issue causing the slowdown is the fact that GCC interchanges the main
> loops, but LLVM does not. After interchanging, the memory access patterns
> are completely different

Well, definitely that's one thing and without interchange cache locality
plays a big role but I think two things should be noted:
1) GCC, even though it interchanges the original code, still runs 25
iterations, so it still kind of does the benchmark.
2) Most importantly, even if we interchange the code ourselves, Clang's
codegen is still very bad both compared to GCC and in general. Also, GCC
does the obvious thing of completely removing the inner loop (now that with
IV `run`), while
Clang doesn't. Basically, these are problems that I feel loop optimizations
should not suffer from at this point :/

Alternatively, if we we would create vector stores instead of the small
> memcpy calls, we probably would get a better result overall. Using Clang's
> Matrix Types extensions effectively does so, and with that version
> https://godbolt.org/z/nvq86W

Oh that's pretty neat. I mean for this code I don't think we should expect
the user to have to write the code like that to get good codegen but it's
still cool.

Cheers,
Stefanos

Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <
florian_hahn at apple.com> έγραψε:

>
>
> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi everyone,
>
> I was watching this video [1]. There's an example of an initialization
> loop for which
> Clang unfortunately generates really bad code [2]. In my machine, the
> Clang version
> is 4x slower than the GCC version. I have not tested the MSVC version, but
> it should
> be around the same.
>
> In case anyone's interested, in the video [1] Casey explains why this code
> is bad (around 59:39).
>
> So, I tried to run -print-after-all [3]. There are a lot of passes that
> interact here, so I was
> wondering if anyone knows more about that. It seems to me that the problem
> starts
> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are
> handled down
> the pipeline. Finally, the regalloc probably did not go very well.
>
>
>
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
> issue.
>
> While the code for the initialization is not ideal, it appears the main
> issue causing the slowdown is the fact that GCC interchanges the main
> loops, but LLVM does not. After interchanging, the memory access patterns
> are completely different (and it also probably slightly defeats the purpose
> of the benchmark).
>
> There’s also an issue with SROA which splits a nice single consecutive
> llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x
> speedup (on top of manually interchanging the loops, which gives a ~3x
> speedup).
>
>
> Alternatively, if we we would create vector stores instead of the small
> memcpy calls, we probably would get a better result overall. Using Clang's
> Matrix Types extensions effectively does so, and with that version
> https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA
> (although the code is not as nice as it code be right now, as there's no
> syntax for constant initializers for matrix types yet)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201002/fea9fbbc/attachment.html>