[LLVMdev] unaligned AVX store gets split into two instructions

Demikhovsky, Elena elena.demikhovsky at intel.com
Wed Jul 10 00:50:55 PDT 2013


Send me a pointer to the code, I'll check performance for our workloads.

-           Elena

From: Nadav Rotem [mailto:nrotem at apple.com]
Sent: Wednesday, July 10, 2013 08:15
To: Eli Friedman
Cc: Zach Devito; LLVM Developers Mailing List; Demikhovsky, Elena
Subject: Re: [LLVMdev] unaligned AVX store gets split into two instructions

Hi,

Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means that they go in one after the other in two cycles.  On Haswell the memory ports are wide enough to allow a 256bit memory operation in one cycle.  So, on Sandybridge we split unaligned memory operations into two 128bit parts to allow them to execute in two separate ports. This is also what GCC and ICC do.

It is very possible that the decision to split the wide vectors causes a regression.  If the memory ports are busy it is better to double-pump them and save the cost of the insert/extract subvector.  Unfortunately, during ISel we don't have a good way to estimate port pressure. In any case, it is a good idea to revise the heuristics that I put in and to see if it matches the Sandybridge optimization guide. If I remember correctly the optimization guide does not have too much information on this, but Elena looked over it and said that it made sense.

BTW, you can validate that this is the problem using the IACA tool. It performs static analysis on your binary and tells you where the critical path is.  http://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Thanks,
Nadav


On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com<mailto:eli.friedman at gmail.com>> wrote:


On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com<mailto:zdevito at gmail.com>> wrote:

I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads
on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as a
single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, which
seems to be due to this.

Any ideas why this changed? Thanks!

This was intentional; apparently doing it with two instructions is
supposed to be faster.  See r172868/r172894.

Adding Nadav in case he has anything more to say.

-Eli

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/1825b729/attachment.html>


More information about the llvm-dev mailing list