[PATCH] D33938: [x86] use vperm2f128 rather than vinsertf128 when there's a chance to fold a 32-byte load

Tue Jun 6 06:43:30 PDT 2017

spatel created this revision.
Herald added a subscriber: mcrosier.

I was looking closer at the x86 test diffs in https://reviews.llvm.org/D33866, and the first change seems like it shouldn't happen in the first place. So this patch is trying to resolve that.

Using Agner's tables and AMD docs, vperm2f128 and vinsertf128 have identical timing for any given CPU model, so we should be able to interchange those without affecting perf. But as we can see in some of the diffs here, using vperm2f128 allows load folding, so we should take that opportunity to reduce code size and register pressure.

A secondary advantage is making AVX1 and AVX2 codegen more similar. Given that vperm2f128 was introduced with AVX1, we should be using it in all of the same situations that we would with AVX2. If there's some reason that an AVX1 CPU would not want to use this instruction, I think that should be fixed up in a later pass.

https://reviews.llvm.org/D33938

Files:
  lib/Target/X86/X86ISelLowering.cpp
  test/CodeGen/X86/avx-vperm2x128.ll
  test/CodeGen/X86/x86-interleaved-access.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D33938.101552.patch
Type: text/x-patch
Size: 7605 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170606/612c642b/attachment.bin>