[PATCH] D14151: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension

Wed Nov 18 10:29:33 PST 2015

RKSimon added a comment.

I'll update this patch to just be a fix, with just the removed lines (and the altered tests) and then prepare a second patch that improves the code quality. We are missing a potentially big perf gain for when the upper half of a register can be implicitly zero'd with VEX encoded instructions - especially on 128-bit ALUs such as Jaguar and Sandy Bridge.


================
Comment at: lib/Target/X86/X86InstrSSE.td:7207-7227
@@ -7221,2 +7206,23 @@
 
+  def : Pat<(v8i32 (X86vzmovl (insert_subvector undef,
+                   (v4i32 VR128:$src), (iPTR 0)))),
+            (SUBREG_TO_REG (i32 0),
+                           (VPBLENDWrri (v4i32 (V_SET0)), VR128:$src, (i8 3)),
+                           sub_xmm)>;
+  def : Pat<(v4i64 (X86vzmovl (insert_subvector undef,
+                   (v2i64 VR128:$src), (iPTR 0)))),
+            (SUBREG_TO_REG (i32 0),
+                           (VPBLENDWrri (v4i32 (V_SET0)), VR128:$src, (i8 15)),
+                           sub_xmm)>;
+  def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
+                   (v4f32 VR128:$src), (iPTR 0)))),
+            (SUBREG_TO_REG (i32 0),
+                           (VBLENDPSrri (v4f32 (V_SET0)), VR128:$src, (i8 1)),
+                           sub_xmm)>;
+  def : Pat<(v4f64 (X86vzmovl (insert_subvector undef,
+                   (v2f64 VR128:$src), (iPTR 0)))),
+            (SUBREG_TO_REG (i32 0),
+                           (VBLENDPDrri (v2f64 (V_SET0)), VR128:$src, (i8 1)),
+                           sub_xmm)>;
+
   // These will incur an FP/int domain crossing penalty, but it may be the only
----------------
andreadb wrote:
> I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node.
> 
> If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952.
> 
> The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes.
> 
> Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector.
> I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector.
I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1 targets with a lot of domain crossing stalls between integer / float to deal with the 256-bit vectors.


Repository:
  rL LLVM

http://reviews.llvm.org/D14151