[PATCH] Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)
Demikhovsky, Elena
elena.demikhovsky at intel.com
Tue Sep 16 23:44:41 PDT 2014
Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:
--- lib/Target/X86/X86InstrSSE.td (revision 217862)
+++ lib/Target/X86/X86InstrSSE.td (working copy)
@@ -5279,6 +5279,11 @@
(v2i64 (scalar_to_vector (loadi64 addr:$src))))),
(VMOVDDUPrm addr:$src)>, Requires<[HasAVX]>;
+ def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+ def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+
// 256-bit version
def : Pat<(X86Movddup (loadv4f64 addr:$src)),
(VMOVDDUPYrm addr:$src)>;
- Elena
-----Original Message-----
From: Sanjay Patel [mailto:spatel at rotateright.com]
Sent: Tuesday, September 16, 2014 18:50
To: spatel at rotateright.com; nrotem at apple.com; chandlerc at gmail.com; Andrea_DiBiagio at sn.scee.net; Demikhovsky, Elena
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)
>>! In D5347#10, @delena wrote:
> I just suggest to add this pattern to X86InstrSSE.td:
>
> def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
> (v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;
I tried this, but it's not producing the codegen that I want. Specifically, we want to use movddup when possible, and we don't want to alter codegen at all when not optimizing for size. (Apologies for pattern ignorance - I haven't used these yet.)
1. In the testcase for v2f64, no splat is generated (movddup expected).
2. In the testcase for v2i64 with AVX, we get:
vbroadcastsd LCPI4_0(%rip), %ymm1
vpaddq %xmm1, %xmm0, %xmm0
vzeroupper <--- can the pattern be rewritten to avoid this? even if yes, movddup is smaller than broadcastsd
This is worse in size than what my patch produces:
vmovddup LCPI4_0(%rip), %xmm1
vpaddq %xmm1, %xmm0, %xmm0
3. In the testcase for v4i64 with AVX, we again would generate vbroadcastsd
vbroadcastsd LCPI5_0(%rip), %ymm1
vextractf128 $1, %ymm0, %xmm2
vpaddq %xmm1, %xmm2, %xmm2
vpaddq %xmm1, %xmm0, %xmm0
vinsertf128 $1, %xmm2, %ymm0, %ymm0
But movddup is better because it is one byte smaller than vbroadcastsd.
4. Using the pattern also caused a failure in test/CodeGen/X86/exedepsfix-broadcast.ll because a broadcast is generated even when not optimizing for size. I don't think we want to use a broadcast in that case?
http://reviews.llvm.org/D5347
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the llvm-commits
mailing list