[LLVMdev] unaligned AVX store gets split into two instructions

Dmitry Babokin babokin at gmail.com
Thu Sep 19 10:40:31 PDT 2013


Update: the problem seems to be fixed by r190916.


On Thu, Sep 19, 2013 at 8:19 PM, Dmitry Babokin <babokin at gmail.com> wrote:

> Nadav,
>
> We see multiple regressions after r172868 in ISPC compiler (based on LLVM
> optimizer). The regressions are due to spill/reloads, which are due to
> increase register pressure. This matches Zach's analysis. We've filed bug
> 17285 for this problem.
>
> Is there any possibility to avoid splitting in case of multiple loads
> going together?
>
> Dmitry.
>
>
> On Wed, Jul 10, 2013 at 1:12 PM, Zach Devito <zdevito at stanford.edu> wrote:
>
>> I've narrowed this down to a single kernel (kernel.ll), which does a
>> fixed-size matrix-matrix multiply:
>>
>> # ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s
>> # ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s
>> # ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32
>> # ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33
>> # time ./harness32
>> real 0m0.584s
>> user 0m0.581s
>> sys 0m0.001s
>> # time ./harness33
>> real 0m0.730s
>> user 0m0.725s
>> sys 0m0.001s
>>
>> If you look at kernel33.s, it has a register spill/reload in the inner
>> loop. This doesn't appear in the llvm 3.2 version and disappears from the
>> 3.3 version if you remove the "align 8"s from kernel.ll which are making it
>> unaligned.  Do the two-instruction unaligned loads increase register
>> pressure? Or is something else going on?
>>
>> Zach
>>
>> On Tue, Jul 9, 2013 at 11:33 PM, Zach Devito <zdevito at stanford.edu>wrote:
>>
>>> Thanks for all the the info! I'm still in the process of narrowing down
>>> the performance difference in my code. I'm no longer convinced its related
>>> to only the unaligned loads/stores alone since extracting this part of the
>>> kernel makes the performance difference disappear.  I will try to narrow
>>> down what is going on and if it seems related LLVM, I will post an example.
>>> Thanks again,
>>>
>>> Zach
>>>
>>>
>>> On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means
>>>> that they go in one after the other in two cycles.  On Haswell the memory
>>>> ports are wide enough to allow a 256bit memory operation in one cycle.  So,
>>>> on Sandybridge we split unaligned memory operations into two 128bit parts
>>>> to allow them to execute in two separate ports. This is also what GCC and
>>>> ICC do.
>>>>
>>>> It is very possible that the decision to split the wide vectors causes
>>>> a regression.  If the memory ports are busy it is better to double-pump
>>>> them and save the cost of the insert/extract subvector.  Unfortunately,
>>>> during ISel we don’t have a good way to estimate port pressure. In any
>>>> case, it is a good idea to revise the heuristics that I put in and to see
>>>> if it matches the Sandybridge optimization guide. If I remember correctly
>>>> the optimization guide does not have too much information on this, but
>>>> Elena looked over it and said that it made sense.
>>>>
>>>> BTW, you can validate that this is the problem using the IACA tool. It
>>>> performs static analysis on your binary and tells you where the critical
>>>> path is.
>>>> http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
>>>>
>>>> Thanks,
>>>> Nadav
>>>>
>>>>
>>>> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com>
>>>> wrote:
>>>>
>>>> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com> wrote:
>>>>
>>>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector
>>>> loads
>>>> on AVX.
>>>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted
>>>> as a
>>>> single instruction (details below).
>>>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
>>>> which
>>>> seems to be due to this.
>>>>
>>>> Any ideas why this changed? Thanks!
>>>>
>>>>
>>>> This was intentional; apparently doing it with two instructions is
>>>> supposed to be faster.  See r172868/r172894.
>>>>
>>>> Adding Nadav in case he has anything more to say.
>>>>
>>>> -Eli
>>>>
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/4b434183/attachment.html>


More information about the llvm-dev mailing list