[PATCH] Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)

Tue Sep 16 23:44:41 PDT 2014

Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:

--- lib/Target/X86/X86InstrSSE.td       (revision 217862)
+++ lib/Target/X86/X86InstrSSE.td       (working copy)
@@ -5279,6 +5279,11 @@
                              (v2i64 (scalar_to_vector (loadi64 addr:$src))))),
             (VMOVDDUPrm addr:$src)>, Requires<[HasAVX]>;
 
+  def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+            (VMOVDDUPrm addr:$src)>;
+  def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+            (VMOVDDUPrm addr:$src)>;
+
   // 256-bit version
   def : Pat<(X86Movddup (loadv4f64 addr:$src)),
             (VMOVDDUPYrm addr:$src)>;

-  Elena


-----Original Message-----
From: Sanjay Patel [mailto:spatel at rotateright.com] 
Sent: Tuesday, September 16, 2014 18:50
To: spatel at rotateright.com; nrotem at apple.com; chandlerc at gmail.com; Andrea_DiBiagio at sn.scee.net; Demikhovsky, Elena
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)

>>! In D5347#10, @delena wrote:
> I just suggest to add this pattern to X86InstrSSE.td:
> 
> def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
>           (v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

I tried this, but it's not producing the codegen that I want. Specifically, we want to use movddup when possible, and we don't want to alter codegen at all when not optimizing for size. (Apologies for pattern ignorance - I haven't used these yet.)

1. In the testcase for v2f64, no splat is generated (movddup expected).

2. In the testcase for v2i64 with AVX, we get:
   vbroadcastsd	LCPI4_0(%rip), %ymm1
   vpaddq	%xmm1, %xmm0, %xmm0
   vzeroupper  <--- can the pattern be rewritten to avoid this? even if yes, movddup is smaller than broadcastsd

This is worse in size than what my patch produces:
   vmovddup	LCPI4_0(%rip), %xmm1
   vpaddq	%xmm1, %xmm0, %xmm0

3. In the testcase for v4i64 with AVX, we again would generate vbroadcastsd
   vbroadcastsd	LCPI5_0(%rip), %ymm1
   vextractf128	$1, %ymm0, %xmm2
   vpaddq	%xmm1, %xmm2, %xmm2
   vpaddq	%xmm1, %xmm0, %xmm0
   vinsertf128	$1, %xmm2, %ymm0, %ymm0

But movddup is better because it is one byte smaller than vbroadcastsd.

4. Using the pattern also caused a failure in test/CodeGen/X86/exedepsfix-broadcast.ll because a broadcast is generated even when not optimizing for size. I don't think we want to use a broadcast in that case?

http://reviews.llvm.org/D5347


---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.