[PATCH] optimize merging of scalar loads for 32-byte vectors [X86, AVX] (PR21710)

Thu Dec 4 13:04:27 PST 2014

Hi qcolombet, andreadb, RKSimon,

This patch fixes the poor codegen seen in PR21710 ( http://llvm.org/bugs/show_bug.cgi?id=21710 ). Before we crack 32-byte build vectors into smaller chunks (and then subsequently glue them back together), we should look for the easy case where we can just load all elements in a single op.

The codegen change for the latter 2 testcases (derived from the bug report examples) is:
   vmovss	16(%rdi), %xmm1
   vmovups	(%rdi), %xmm0
   vinsertps	$16, 20(%rdi), %xmm1, %xmm1
   vinsertps	$32, 24(%rdi), %xmm1, %xmm1
   vinsertps	$48, 28(%rdi), %xmm1, %xmm1
   vinsertf128	$1, %xmm1, %ymm0, %ymm0
   retq

To:
   vmovups	(%rdi), %ymm0
   retq

And:
   vmovsd	16(%rdi), %xmm1
   vmovupd	(%rdi), %xmm0
   vmovhpd	24(%rdi), %xmm1, %xmm1
   vinsertf128	$1, %xmm1, %ymm0, %ymm0
   retq

To:
   vmovups	(%rdi), %ymm0
   retq

I think it's benign that we generate 'vmovups' in that 2nd case rather than 'vmovupd' because we're not using the result here. I confirmed that we will use a double instruction if we actually use the load result in this function. 

I've also updated the existing load merge test to use FileCheck and added a v4f32 test for completeness.

http://reviews.llvm.org/D6536

Files:
  lib/Target/X86/X86ISelLowering.cpp
  test/CodeGen/X86/vec_loadsingles.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D6536.16945.patch
Type: text/x-patch
Size: 4874 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141204/b5b9e161/attachment.bin>