[PATCH] Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)

Tue Sep 16 08:50:26 PDT 2014

>>! In D5347#10, @delena wrote:
> I just suggest to add this pattern to X86InstrSSE.td:
> 
> def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
>           (v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

I tried this, but it's not producing the codegen that I want. Specifically, we want to use movddup when possible, and we don't want to alter codegen at all when not optimizing for size. (Apologies for pattern ignorance - I haven't used these yet.)

1. In the testcase for v2f64, no splat is generated (movddup expected).

2. In the testcase for v2i64 with AVX, we get:
   vbroadcastsd	LCPI4_0(%rip), %ymm1
   vpaddq	%xmm1, %xmm0, %xmm0
   vzeroupper  <--- can the pattern be rewritten to avoid this? even if yes, movddup is smaller than broadcastsd

This is worse in size than what my patch produces:
   vmovddup	LCPI4_0(%rip), %xmm1
   vpaddq	%xmm1, %xmm0, %xmm0

3. In the testcase for v4i64 with AVX, we again would generate vbroadcastsd
   vbroadcastsd	LCPI5_0(%rip), %ymm1
   vextractf128	$1, %ymm0, %xmm2
   vpaddq	%xmm1, %xmm2, %xmm2
   vpaddq	%xmm1, %xmm0, %xmm0
   vinsertf128	$1, %xmm2, %ymm0, %ymm0

But movddup is better because it is one byte smaller than vbroadcastsd.

4. Using the pattern also caused a failure in test/CodeGen/X86/exedepsfix-broadcast.ll because a broadcast is generated even when not optimizing for size. I don't think we want to use a broadcast in that case?

http://reviews.llvm.org/D5347