[PATCH] D18676: [x86] avoid intermediate splat for non-zero memsets (PR27100)

Fri Apr 1 07:25:20 PDT 2016

spatel added inline comments.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:2035
@@ +2034,3 @@
+      if (Size >= 32 && Subtarget.hasAVX()) {
+        // Although this isn't a legal type for AVX1, we'll let legalization
+        // and shuffle lowering produce the optimal codegen. If we choose
----------------
RKSimon wrote:
> This /is/ a legal type for AVX1 - the trouble is we can't do much with it.
Thanks - I'll fix that comment before committing.

================
Comment at: test/CodeGen/X86/memset-nonzero.ll:94
@@ +93,3 @@
+; AVX-LABEL: memset_128_nonzero_bytes:
+; AVX:         vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42]
+; AVX-NEXT:    vmovups %ymm0, 96(%rdi)
----------------
RKSimon wrote:
> andreadb wrote:
> > I noticed that on AVX we now always generate a vmovaps to load a vector of constants.
> > That's obviously fine. However, I wonder if a vbroadcastss would be more appropriate in this case as it would use a smaller constant (for code size only - in this example we would save 28 bytes).
> This is what is being discussed on PR27141 - its proving tricky to determine when the broadcast is worth it and when it will cause register pressure issues.
The difference is actually on the AVX2 side; AVX1 was loading a vector already just with a different format (v4f32).

I know we've discussed the splat load vs. vector load trade-off before. Let's follow up in PR27141, or I'll open another report where we can see the problem that Simon mentioned directly.

http://reviews.llvm.org/D18676