[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Florian Hahn via llvm-dev llvm-dev at lists.llvm.org
Thu Oct 1 13:59:02 PDT 2020



> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi,
> 
>> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> Hi everyone,
>> 
>> I was watching this video [1]. There's an example of an initialization loop for which
>> Clang unfortunately generates really bad code [2]. In my machine, the Clang version
>> is 4x slower than the GCC version. I have not tested the MSVC version, but it should
>> be around the same.
>> 
>> In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39).
>> 
>> So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was
>> wondering if anyone knows more about that. It seems to me that the problem starts
>> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down
>> the pipeline. Finally, the regalloc probably did not go very well.
> 
> 
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 <https://bugs.llvm.org/show_bug.cgi?id=47705> to keep track of the issue.
> 
> While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different (and it also probably slightly defeats the purpose of the benchmark).
> 
> There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup (on top of manually interchanging the loops, which gives a ~3x speedup).

Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201001/d523a34e/attachment.html>


More information about the llvm-dev mailing list