[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Stefanos Baziotis via llvm-dev llvm-dev at lists.llvm.org
Thu Oct 1 15:28:15 PDT 2020


Oh also another thing that's important:

on top of manually interchanging the loops, which gives a ~3x speedup).


Indeed in my machine too. Although I understand that this speedup is not
the important thing here, it has to be noted what
GCC does with them interchanged:

Clang:
- Original: 2.04
- Interchanged: 0.60

G++:
- Original: 0.43
- Interchanged: 0.15

So, GCC gets a 3x speedup too and also the interchanged version of GCC is
still 4x faster.

Best,
Stefanos

Στις Παρ, 2 Οκτ 2020 στις 1:16 π.μ., ο/η Stefanos Baziotis <
stefanos.baziotis at gmail.com> έγραψε:

> Hi Florian,
>
> Thanks for providing feedback.
>
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
>> issue.
>
>
> Great thanks I wasn't sure if it's considered a bug / regression /
> whatever.
>
> There’s also an issue with SROA which splits a nice single consecutive
>> llvm.memcpy into 3 separate ones.
>
>
> Yes exactly, as I said in my first message SROA seems to be a big problem
> here.
>
> While the code for the initialization is not ideal, it appears the main
>> issue causing the slowdown is the fact that GCC interchanges the main
>> loops, but LLVM does not. After interchanging, the memory access patterns
>> are completely different
>
>
> Well, definitely that's one thing and without interchange cache locality
> plays a big role but I think two things should be noted:
> 1) GCC, even though it interchanges the original code, still runs 25
> iterations, so it still kind of does the benchmark.
> 2) Most importantly, even if we interchange the code ourselves, Clang's
> codegen is still very bad both compared to GCC and in general. Also, GCC
> does the obvious thing of completely removing the inner loop (now that with
> IV `run`), while
> Clang doesn't. Basically, these are problems that I feel loop
> optimizations should not suffer from at this point :/
>
> Alternatively, if we we would create vector stores instead of the small
>> memcpy calls, we probably would get a better result overall. Using Clang's
>> Matrix Types extensions effectively does so, and with that version
>> https://godbolt.org/z/nvq86W
>
>
> Oh that's pretty neat. I mean for this code I don't think we should expect
> the user to have to write the code like that to get good codegen but it's
> still cool.
>
> Cheers,
> Stefanos
>
> Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <
> florian_hahn at apple.com> έγραψε:
>
>>
>>
>> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi,
>>
>> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi everyone,
>>
>> I was watching this video [1]. There's an example of an initialization
>> loop for which
>> Clang unfortunately generates really bad code [2]. In my machine, the
>> Clang version
>> is 4x slower than the GCC version. I have not tested the MSVC version,
>> but it should
>> be around the same.
>>
>> In case anyone's interested, in the video [1] Casey explains why this
>> code is bad (around 59:39).
>>
>> So, I tried to run -print-after-all [3]. There are a lot of passes that
>> interact here, so I was
>> wondering if anyone knows more about that. It seems to me that the
>> problem starts
>> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are
>> handled down
>> the pipeline. Finally, the regalloc probably did not go very well.
>>
>>
>>
>> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
>> issue.
>>
>> While the code for the initialization is not ideal, it appears the main
>> issue causing the slowdown is the fact that GCC interchanges the main
>> loops, but LLVM does not. After interchanging, the memory access patterns
>> are completely different (and it also probably slightly defeats the purpose
>> of the benchmark).
>>
>> There’s also an issue with SROA which splits a nice single consecutive
>> llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x
>> speedup (on top of manually interchanging the loops, which gives a ~3x
>> speedup).
>>
>>
>> Alternatively, if we we would create vector stores instead of the small
>> memcpy calls, we probably would get a better result overall. Using Clang's
>> Matrix Types extensions effectively does so, and with that version
>> https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA
>> (although the code is not as nice as it code be right now, as there's no
>> syntax for constant initializers for matrix types yet)
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201002/193097be/attachment-0001.html>


More information about the llvm-dev mailing list