[PATCH] D24681: Optimize patterns of vectorized interleaved memory accesses for X86.

Fri Sep 16 13:01:00 PDT 2016

Farhana created this revision.
Farhana added reviewers: DavidKreitzer, mkuper, delena.
Farhana added a subscriber: llvm-commits.
Herald added subscribers: mgorny, beanz, aemerson.

[X86InterleavedAccess] Optimize patterns of vectorized interleaved memory accesses for X86.

Prior to this, there were no x86 implementation of InterleavedAccessPass which detects a set of interleaved accesses and generates target specific intrinsics. 
Here is an example of interleaved loads:

//        %wide.vec = load <8 x i32>, <8 x i32>* %ptr
//        %v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>
//        %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>

The ARM implementation generates ldn/stn intrinsics.

The change-set here places the basic framework to support InterleavedAccessPass on X86. It also, tries to detect an interleaved pattern(with 4 interleaved accesses, stride:4, 64-bit on AVX/AVX2) and generate optimized sequence for that.

This is just the first step of a long effort. The short-term plan is to continue supporting a few patterns this way while we work out a more general solution.

In order to allow code sharing between multiple transpose functions, the next change-set will introduce a class that will encapsulate all the necessary information.

Due to this change-set,
/// Current supported interleaved loads: here, T = {i/f}
///   %wide.vec = load <16 x T64>, <16 x T64>* %ptr
///   %v0 = shuffle %wide.vec, undef, <0, 4, 8, 12>  ;
///   %v1 = shuffle %wide.vec, undef, <1, 5, 9, 13>  ;
///   %v2 = shuffle %wide.vec, undef, <2, 6, 10, 14> ;
///   %v3 = shuffle %wide.vec, undef, <3, 7, 11, 15> ;
///
///   Into:
///   %load0 = load <4 x T64>, <4 x T64>* %ptr
///   %load1 = load <4 x T64>, <4 x T64>* %ptr+32
///   %load2 = load <4 x T64>, <4 x T64>* %ptr+64
///   %load3 = load <4 x T64>, <4 x T64>* %ptr+96
///
///   %intrshuffvec1 = shuffle %load0, %load2, <0, 1, 4, 5>;
///   %intrshuffvec2 = shuffle %load1, %load3, <0, 1, 4, 5>;
///   %v0 = shuffle %intrshuffvec1, %intrshuffvec2, <0, 4, 2, 6>;
///   %v1 = shuffle %intrshuffvec1, %intrshuffvec2, <1, 5, 3, 7>;
///
///   %intrshuffvec3 = shuffle %load0, %load2, <2, 3, 6, 7>;
///   %intrshuffvec4 = shuffle %load1, %load3, <2, 3, 6, 7>;
///   %v2 = shuffle %intrshuffvec3, %intrshuffvec4, <0, 4, 2, 6>;
///   %v3 = shuffle %intrshuffvec3, %intrshuffvec4, <1, 5, 3, 7>;

https://reviews.llvm.org/D24681

Files:
  lib/CodeGen/InterleavedAccessPass.cpp
  lib/Target/X86/CMakeLists.txt
  lib/Target/X86/X86ISelLowering.h
  lib/Target/X86/X86InterleavedAccess.cpp
  lib/Target/X86/X86TargetMachine.cpp
  test/CodeGen/X86/x86-interleaved-access.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D24681.71690.patch
Type: text/x-patch
Size: 11654 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160916/75e723c8/attachment.bin>