[llvm-dev] LLVM struct, alloca, SROA and the entry basic block

Wed Sep 9 10:13:40 PDT 2015

On Sep 8, 2015, at 2:11 PM, Benoit Belley via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> You will find assembly sequences such as:
> 
>         movss   dword ptr [rcx - 12], xmm4 # 32-bit store
>         movss   dword ptr [rcx - 16], xmm3 # 32-bit store
>         mov rdx, qword ptr [rcx - 16]      # 64-bit load
> 
> Notice how the stores and loads are back-to-back and of different bit-width.  On my processor (Intel Sandy Bridge), this sequence seems to fail store-forwarding and to cause a huge CPU pipeline stall. Or at least, this is what the following CPU performance counter leads me to believe:
> 
> LD_BLOCKS.STORE_FORWARD: Loads blocked by overlapping with store buffer that cannot be forwarded.

Yep, that happens for all intel archs: if you have a wider load than a store, store forwarding is blocked. It doesn't support forwarding two stores into one load, even if everything's properly aligned.

> My test case is generating 1,500,000,000  of these "blocked store-forwarding » when using LLVM 3.7 versus 74,000 for LLVM 3.6! The number of instructions executed per CPU cycles goes down to 0.7 IPC instead of 2.2 IPC.
> 
> Further analysis suggests that it might be due to the GVN pass (which runs just before the MemCpy pass) which actually combines 2 32-bit loads into a single 64-bit load.  See the attached files.
> 
> I have also noted that the alloca are actually getting properly annotated with an alignment of 8 bytes by the « Combine redundant instructions » pass. So, I guess that annotating alloca when emitting LLVM IR within our JIT compiler is unnecessary. Is that a fair assessment ?
> 
> Is store-forwarding always blocking on these kind of memory accesses even if they are properly aligned ?
> 
> (Side note: Moving the alloca into the entry BB, causes all of these redundant alloca, store and load instructions to be optimized out and the entire store-forwarding issue goes away for this particular test case. But, isn’t this an issue that could be triggered in other valid cases ?) 

I'd hope it's less likely to be a problem in other situations, as if the optimizer is working properly (and not "broken" by the use of alloca outside the entry block), it's less likely to have a load of an address immediately following a store of the same address -- the compiler should've just used the registers in the first place. Store forwarding is most important when the store and load are neighboring -- the farther away you get the less likely it is that the store is still working its way through the pipeline.