<div dir="ltr">This patch has broken buildbots: <a href="http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/6694/steps/test/logs/stdio">http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/6694/steps/test/logs/stdio</a><div><br></div><div>Please fix or revert.</div><div><br><div>...</div><div><div>FAIL: LLVM :: Transforms/SLPVectorizer/AArch64/gather-root.ll (32439 of 33934)</div><div>******************** TEST 'LLVM :: Transforms/SLPVectorizer/AArch64/gather-root.ll' FAILED ********************</div><div>Script:</div><div>--</div><div>/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt < /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll -slp-vectorizer -S | /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/FileCheck /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll --check-prefix=DEFAULT</div><div>/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt < /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-30 -slp-vectorizer -S | /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/FileCheck /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll --check-prefix=GATHER</div><div>/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt < /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll -slp-schedule-budget=0 -slp-threshold=-30 -slp-vectorizer -S | /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/FileCheck /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll --check-prefix=MAX-COST</div><div>--</div><div>Exit Code: 2</div><div><br></div><div>Command Output (stderr):</div><div>--</div><div>opt: /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/lib/Transforms/Vectorize/SLPVectorizer.cpp:3292: llvm::Value *llvm::slpvectorizer::BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &): Assertion `!E->NeedToGather && "Extracting from a gather list"' failed.</div><div>#0 0x0000000001c49c34 PrintStackTraceSignalHandler(void*) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1c49c34)</div><div>#1 0x0000000001c49f76 SignalHandler(int) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1c49f76)</div><div>#2 0x00007fc461e8e0c0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x110c0)</div><div>#3 0x00007fc460a28fff gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x32fff)</div><div>#4 0x00007fc460a2a42a abort (/lib/x86_64-linux-gnu/libc.so.6+0x3442a)</div><div>#5 0x00007fc460a21e67 (/lib/x86_64-linux-gnu/libc.so.6+0x2be67)</div><div>#6 0x00007fc460a21f12 (/lib/x86_64-linux-gnu/libc.so.6+0x2bf12)</div><div>#7 0x0000000001d7c5fd llvm::slpvectorizer::BoUpSLP::vectorizeTree(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u> >, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u> > > > >&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1d7c5fd)</div><div>#8 0x0000000001d87aab llvm::SLPVectorizerPass::vectorizeRootInstruction(llvm::PHINode*, llvm::Value*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1d87aab)</div><div>#9 0x0000000001d831e6 llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1d831e6)</div><div>#10 0x0000000001d81c10 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1d81c10)</div><div>#11 0x0000000001d8e1d6 (anonymous namespace)::SLPVectorizer::runOnFunction(llvm::Function&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1d8e1d6)</div><div>#12 0x000000000177646f llvm::FPPassManager::runOnFunction(llvm::Function&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x177646f)</div><div>#13 0x00000000017766c3 llvm::FPPassManager::runOnModule(llvm::Module&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x17766c3)</div><div>#14 0x0000000001776bc6 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x1776bc6)</div><div>#15 0x00000000006f6e0f main (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x6f6e0f)</div><div>#16 0x00007fc460a162e1 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e1)</div><div>#17 0x00000000006e803a _start (/home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt+0x6e803a)</div><div>Stack dump:</div><div>0.<span style="white-space:pre">   </span>Program arguments: /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/opt -slp-schedule-budget=0 -slp-min-tree-size=0 -slp-threshold=-30 -slp-vectorizer -S </div><div>1.<span style="white-space:pre">      </span>Running pass 'Function Pass Manager' on module '<stdin>'.</div><div>2.<span style="white-space:pre">     </span>Running pass 'SLP Vectorizer' on function '@PR28330'</div><div>FileCheck error: '-' is empty.</div><div>FileCheck command line:  /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.obj/./bin/FileCheck /home/llvmbb/llvm-build-dir/clang-x86_64-debian-fast/llvm.src/test/Transforms/SLPVectorizer/AArch64/gather-root.ll --check-prefix=GATHER</div><div><br></div><div>--</div><div><br></div><div>********************</div></div><div>...</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 20, 2017 at 10:18 AM, Mohammad Shahid via llvm-commits <span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Author: ashahid<br>
Date: Wed Sep 20 01:18:28 2017<br>
New Revision: 313736<br>
<br>
URL: <a href="http://llvm.org/viewvc/llvm-project?rev=313736&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project?rev=313736&view=rev</a><br>
Log:<br>
[SLP] Vectorize jumbled memory loads.<br>
<br>
Summary:<br>
This patch tries to vectorize loads of consecutive memory accesses, accessed<br>
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905<br>
which was reverted back due to some basic issue with representing the 'use mask' of<br>
jumbled accesses.<br>
<br>
This patch fixes the mask representation by recording the 'use mask' in the usertree entry.<br>
<br>
Change-Id: I9fe7f5045f065d84c126fa307ef6e<wbr>be0787296df<br>
<br>
Reviewers: mkuper, loladiro, Ayal, zvi, danielcdh<br>
<br>
Reviewed By: Ayal<br>
<br>
Subscribers: mzolotukhin<br>
<br>
Differential Revision: <a href="https://reviews.llvm.org/D36130" rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D36130</a><br>
<br>
Commit after rebase for patch D36130<br>
<br>
Change-Id: I8add1c265455669ef288d880f870a<wbr>9522c8c08ab<br>
<br>
Added:<br>
    llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-shuffle-placement.ll<br>
Modified:<br>
    llvm/trunk/include/llvm/<wbr>Analysis/LoopAccessAnalysis.h<br>
    llvm/trunk/lib/Analysis/<wbr>LoopAccessAnalysis.cpp<br>
    llvm/trunk/lib/Transforms/<wbr>Vectorize/SLPVectorizer.cpp<br>
    llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-multiuse.ll<br>
    llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load.ll<br>
    llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/store-<wbr>jumbled.ll<br>
<br>
Modified: llvm/trunk/include/llvm/<wbr>Analysis/LoopAccessAnalysis.h<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/Analysis/LoopAccessAnalysis.h?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/include/<wbr>llvm/Analysis/<wbr>LoopAccessAnalysis.h?rev=<wbr>313736&r1=313735&r2=313736&<wbr>view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/include/llvm/<wbr>Analysis/LoopAccessAnalysis.h (original)<br>
+++ llvm/trunk/include/llvm/<wbr>Analysis/LoopAccessAnalysis.h Wed Sep 20 01:18:28 2017<br>
@@ -667,6 +667,21 @@ int64_t getPtrStride(<wbr>PredicatedScalarEvo<br>
                      const ValueToValueMap &StridesMap = ValueToValueMap(),<br>
                      bool Assume = false, bool ShouldCheckWrap = true);<br>
<br>
+/// \brief Attempt to sort the 'loads' in \p VL and return the sorted values in<br>
+/// \p Sorted.<br>
+///<br>
+/// Returns 'false' if sorting is not legal or feasible, otherwise returns<br>
+/// 'true'. If \p Mask is not null, it also returns the \p Mask which is the<br>
+/// shuffle mask for actual memory access order.<br>
+///<br>
+/// For example, for a given VL of memory accesses in program order, a[i+2],<br>
+/// a[i+0], a[i+1] and a[i+3], this function will sort the VL and save the<br>
+/// sorted value in 'Sorted' as a[i+0], a[i+1], a[i+2], a[i+3] and saves the<br>
+/// mask for actual memory accesses in program order in 'Mask' as <2,0,1,3><br>
+bool sortLoadAccesses(ArrayRef<<wbr>Value *> VL, const DataLayout &DL,<br>
+    ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted,<br>
+    SmallVectorImpl<unsigned> *Mask = nullptr);<br>
+<br>
 /// \brief Returns true if the memory operations \p A and \p B are consecutive.<br>
 /// This is a simple API that does not depend on the analysis pass.<br>
 bool isConsecutiveAccess(Value *A, Value *B, const DataLayout &DL,<br>
<br>
Modified: llvm/trunk/lib/Analysis/<wbr>LoopAccessAnalysis.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Analysis/LoopAccessAnalysis.cpp?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/lib/<wbr>Analysis/LoopAccessAnalysis.<wbr>cpp?rev=313736&r1=313735&r2=<wbr>313736&view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/lib/Analysis/<wbr>LoopAccessAnalysis.cpp (original)<br>
+++ llvm/trunk/lib/Analysis/<wbr>LoopAccessAnalysis.cpp Wed Sep 20 01:18:28 2017<br>
@@ -1107,6 +1107,76 @@ static unsigned getAddressSpaceOperand(V<br>
   return -1;<br>
 }<br>
<br>
+// TODO:This API can be improved by using the permutation of given width as the<br>
+// accesses are entered into the map.<br>
+bool llvm::sortLoadAccesses(<wbr>ArrayRef<Value *> VL, const DataLayout &DL,<br>
+                           ScalarEvolution &SE,<br>
+                           SmallVectorImpl<Value *> &Sorted,<br>
+                           SmallVectorImpl<unsigned> *Mask) {<br>
+  SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;<br>
+  OffValPairs.reserve(VL.size())<wbr>;<br>
+  Sorted.reserve(VL.size());<br>
+<br>
+  // Walk over the pointers, and map each of them to an offset relative to<br>
+  // first pointer in the array.<br>
+  Value *Ptr0 = getPointerOperand(VL[0]);<br>
+  const SCEV *Scev0 = SE.getSCEV(Ptr0);<br>
+  Value *Obj0 = GetUnderlyingObject(Ptr0, DL);<br>
+  PointerType *PtrTy = dyn_cast<PointerType>(Ptr0-><wbr>getType());<br>
+  uint64_t Size = DL.getTypeAllocSize(PtrTy-><wbr>getElementType());<br>
+<br>
+  for (auto *Val : VL) {<br>
+    // The only kind of access we care about here is load.<br>
+    if (!isa<LoadInst>(Val))<br>
+      return false;<br>
+<br>
+    Value *Ptr = getPointerOperand(Val);<br>
+    assert(Ptr && "Expected value to have a pointer operand.");<br>
+    // If a pointer refers to a different underlying object, bail - the<br>
+    // pointers are by definition incomparable.<br>
+    Value *CurrObj = GetUnderlyingObject(Ptr, DL);<br>
+    if (CurrObj != Obj0)<br>
+      return false;<br>
+<br>
+    const SCEVConstant *Diff =<br>
+        dyn_cast<SCEVConstant>(SE.<wbr>getMinusSCEV(SE.getSCEV(Ptr), Scev0));<br>
+    // The pointers may not have a constant offset from each other, or SCEV<br>
+    // may just not be smart enough to figure out they do. Regardless,<br>
+    // there's nothing we can do.<br>
+    if (!Diff || Diff->getAPInt().abs().<wbr>getSExtValue() > (VL.size() - 1) * Size)<br>
+      return false;<br>
+<br>
+    OffValPairs.emplace_back(Diff-<wbr>>getAPInt().getSExtValue(), Val);<br>
+  }<br>
+  SmallVector<unsigned, 4> UseOrder(VL.size());<br>
+  for (unsigned i = 0; i < VL.size(); i++) {<br>
+    UseOrder[i] = i;<br>
+  }<br>
+<br>
+  // Sort the memory accesses and keep the order of their uses in UseOrder.<br>
+  std::sort(UseOrder.begin(), UseOrder.end(),<br>
+            [&OffValPairs](unsigned Left, unsigned Right) {<br>
+            return OffValPairs[Left].first < OffValPairs[Right].first;<br>
+            });<br>
+<br>
+  for (unsigned i = 0; i < VL.size(); i++)<br>
+    Sorted.emplace_back(<wbr>OffValPairs[UseOrder[i]].<wbr>second);<br>
+<br>
+  // Sort UseOrder to compute the Mask.<br>
+  if (Mask) {<br>
+    Mask->reserve(VL.size());<br>
+    for (unsigned i = 0; i < VL.size(); i++)<br>
+      Mask->emplace_back(i);<br>
+    std::sort(Mask->begin(), Mask->end(),<br>
+              [&UseOrder](unsigned Left, unsigned Right) {<br>
+              return UseOrder[Left] < UseOrder[Right];<br>
+              });<br>
+  }<br>
+<br>
+  return true;<br>
+}<br>
+<br>
+<br>
 /// Returns true if the memory operations \p A and \p B are consecutive.<br>
 bool llvm::isConsecutiveAccess(<wbr>Value *A, Value *B, const DataLayout &DL,<br>
                                ScalarEvolution &SE, bool CheckType) {<br>
<br>
Modified: llvm/trunk/lib/Transforms/<wbr>Vectorize/SLPVectorizer.cpp<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/lib/<wbr>Transforms/Vectorize/<wbr>SLPVectorizer.cpp?rev=313736&<wbr>r1=313735&r2=313736&view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/lib/Transforms/<wbr>Vectorize/SLPVectorizer.cpp (original)<br>
+++ llvm/trunk/lib/Transforms/<wbr>Vectorize/SLPVectorizer.cpp Wed Sep 20 01:18:28 2017<br>
@@ -637,17 +637,23 @@ private:<br>
   int getEntryCost(TreeEntry *E);<br>
<br>
   /// This is the recursive part of buildTree.<br>
-  void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);<br>
+  void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int UserIndx = -1,<br>
+                     int OpdNum = 0);<br>
<br>
   /// \returns True if the ExtractElement/ExtractValue instructions in VL can<br>
   /// be vectorized to use the original vector (or aggregate "bitcast" to a vector).<br>
   bool canReuseExtract(ArrayRef<Value *> VL, Value *OpValue) const;<br>
<br>
-  /// Vectorize a single entry in the tree.<br>
-  Value *vectorizeTree(TreeEntry *E);<br>
-<br>
-  /// Vectorize a single entry in the tree, starting in \p VL.<br>
-  Value *vectorizeTree(ArrayRef<Value *> VL);<br>
+  /// Vectorize a single entry in the tree.\p OpdNum indicate the ordinality of<br>
+  /// operand corrsponding to this tree entry \p E for the user tree entry<br>
+  /// indicated by \p UserIndx.<br>
+  //  In other words, "E == TreeEntry[UserIndx].<wbr>getOperand(OpdNum)".<br>
+  Value *vectorizeTree(TreeEntry *E, int OpdNum = 0, int UserIndx = -1);<br>
+<br>
+  /// Vectorize a single entry in the tree, starting in \p VL.\p OpdNum indicate<br>
+  /// the ordinality of operand corrsponding to the \p VL of scalar values for the<br>
+  /// user indicated by \p UserIndx this \p VL feeds into.<br>
+  Value *vectorizeTree(ArrayRef<Value *> VL, int OpdNum = 0, int UserIndx = -1);<br>
<br>
   /// \returns the pointer to the vectorized value if \p VL is already<br>
   /// vectorized, or NULL. They may happen in cycles.<br>
@@ -685,7 +691,7 @@ private:<br>
                                       SmallVectorImpl<Value *> &Left,<br>
                                       SmallVectorImpl<Value *> &Right);<br>
   struct TreeEntry {<br>
-    TreeEntry(std::vector<<wbr>TreeEntry> &Container) : Container(Container) {}<br>
+    TreeEntry(std::vector<<wbr>TreeEntry> &Container) : ShuffleMask(), Container(Container) {}<br>
<br>
     /// \returns true if the scalars in VL are equal to this entry.<br>
     bool isSame(ArrayRef<Value *> VL) const {<br>
@@ -693,6 +699,16 @@ private:<br>
       return std::equal(VL.begin(), VL.end(), Scalars.begin());<br>
     }<br>
<br>
+    /// \returns true if the scalars in VL are found in this tree entry.<br>
+    bool isFoundJumbled(ArrayRef<Value *> VL, const DataLayout &DL,<br>
+        ScalarEvolution &SE) const {<br>
+      assert(VL.size() == Scalars.size() && "Invalid size");<br>
+      SmallVector<Value *, 8> List;<br>
+      if (!sortLoadAccesses(VL, DL, SE, List))<br>
+        return false;<br>
+      return std::equal(List.begin(), List.end(), Scalars.begin());<br>
+    }<br>
+<br>
     /// A vector of scalars.<br>
     ValueList Scalars;<br>
<br>
@@ -702,6 +718,14 @@ private:<br>
     /// Do we need to gather this sequence ?<br>
     bool NeedToGather = false;<br>
<br>
+    /// Records optional shuffle mask for the uses of jumbled memory accesses.<br>
+    /// For example, a non-empty ShuffleMask[1] represents the permutation of<br>
+    /// lanes that operand #1 of this vectorized instruction should undergo<br>
+    /// before feeding this vectorized instruction, whereas an empty<br>
+    /// ShuffleMask[0] indicates that the lanes of operand #0 of this vectorized<br>
+    /// instruction need not be permuted at all.<br>
+    SmallVector<unsigned, 4> ShuffleMask[3];<br>
+<br>
     /// Points back to the VectorizableTree.<br>
     ///<br>
     /// Only used for Graphviz right now.  Unfortunately GraphTrait::NodeRef has<br>
@@ -717,12 +741,25 @@ private:<br>
<br>
   /// Create a new VectorizableTree entry.<br>
   TreeEntry *newTreeEntry(ArrayRef<Value *> VL, bool Vectorized,<br>
-                          int &UserTreeIdx) {<br>
+                          int &UserTreeIdx, const InstructionsState &S,<br>
+                          ArrayRef<unsigned> ShuffleMask = None,<br>
+                          int OpdNum = 0) {<br>
+    assert((!Vectorized || S.Opcode != 0) &&<br>
+           "Vectorized TreeEntry without opcode");<br>
     VectorizableTree.emplace_back(<wbr>VectorizableTree);<br>
+<br>
     int idx = VectorizableTree.size() - 1;<br>
     TreeEntry *Last = &VectorizableTree[idx];<br>
     Last->Scalars.insert(Last-><wbr>Scalars.begin(), VL.begin(), VL.end());<br>
     Last->NeedToGather = !Vectorized;<br>
+<br>
+    TreeEntry *UserEntry = &VectorizableTree[UserTreeIdx]<wbr>;<br>
+    if (!ShuffleMask.empty()) {<br>
+      assert(UserEntry->ShuffleMask[<wbr>OpdNum].empty() && "Mask already present!");<br>
+      UserEntry->ShuffleMask[OpdNum]<wbr>.insert(<br>
+          UserEntry->ShuffleMask[OpdNum]<wbr>.begin(), ShuffleMask.begin(),<br>
+          ShuffleMask.end());<br>
+    }<br>
     if (Vectorized) {<br>
       for (int i = 0, e = VL.size(); i != e; ++i) {<br>
         assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");<br>
@@ -1373,34 +1410,34 @@ void BoUpSLP::buildTree(ArrayRef<<wbr>Value *<br>
 }<br>
<br>
 void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Value *> VL, unsigned Depth,<br>
-                            int UserTreeIdx) {<br>
+                            int UserTreeIdx, int OpdNum) {<br>
   assert((allConstant(VL) || allSameType(VL)) && "Invalid types!");<br>
<br>
   InstructionsState S = getSameOpcode(VL);<br>
   if (Depth == RecursionMaxDepth) {<br>
     DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");<br>
-    newTreeEntry(VL, false, UserTreeIdx);<br>
+    newTreeEntry(VL, false, UserTreeIdx, S);<br>
     return;<br>
   }<br>
<br>
   // Don't handle vectors.<br>
   if (S.OpValue->getType()-><wbr>isVectorTy()) {<br>
     DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");<br>
-    newTreeEntry(VL, false, UserTreeIdx);<br>
+    newTreeEntry(VL, false, UserTreeIdx, S);<br>
     return;<br>
   }<br>
<br>
   if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue)<wbr>)<br>
     if (SI->getValueOperand()-><wbr>getType()->isVectorTy()) {<br>
       DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       return;<br>
     }<br>
<br>
   // If all of the operands are identical or constant we have a simple solution.<br>
   if (allConstant(VL) || isSplat(VL) || !allSameBlock(VL) || !S.Opcode) {<br>
     DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");<br>
-    newTreeEntry(VL, false, UserTreeIdx);<br>
+    newTreeEntry(VL, false, UserTreeIdx, S);<br>
     return;<br>
   }<br>
<br>
@@ -1412,7 +1449,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
     if (EphValues.count(VL[i])) {<br>
       DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<<br>
             ") is ephemeral.\n");<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       return;<br>
     }<br>
   }<br>
@@ -1423,7 +1460,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");<br>
       if (E->Scalars[i] != VL[i]) {<br>
         DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");<br>
-        newTreeEntry(VL, false, UserTreeIdx);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
         return;<br>
       }<br>
     }<br>
@@ -1442,7 +1479,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       if (getTreeEntry(I)) {<br>
       DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<<br>
             ") is already in tree.\n");<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       return;<br>
     }<br>
   }<br>
@@ -1452,7 +1489,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
   for (unsigned i = 0, e = VL.size(); i != e; ++i) {<br>
     if (MustGather.count(VL[i])) {<br>
       DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       return;<br>
     }<br>
   }<br>
@@ -1466,7 +1503,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
     // Don't go into unreachable blocks. They may contain instructions with<br>
     // dependency cycles which confuse the final scheduling.<br>
     DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");<br>
-    newTreeEntry(VL, false, UserTreeIdx);<br>
+    newTreeEntry(VL, false, UserTreeIdx, S);<br>
     return;<br>
   }<br>
<br>
@@ -1475,7 +1512,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
     for (unsigned j = i+1; j < e; ++j)<br>
       if (VL[i] == VL[j]) {<br>
         DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");<br>
-        newTreeEntry(VL, false, UserTreeIdx);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
         return;<br>
       }<br>
<br>
@@ -1490,7 +1527,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
     assert((!BS.getScheduleData(<wbr>VL0) ||<br>
             !BS.getScheduleData(VL0)-><wbr>isPartOfBundle()) &&<br>
            "tryScheduleBundle should cancelScheduling on failure");<br>
-    newTreeEntry(VL, false, UserTreeIdx);<br>
+    newTreeEntry(VL, false, UserTreeIdx, S);<br>
     return;<br>
   }<br>
   DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");<br>
@@ -1509,12 +1546,12 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
           if (Term) {<br>
             DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");<br>
             BS.cancelScheduling(VL, VL0);<br>
-            newTreeEntry(VL, false, UserTreeIdx);<br>
+            newTreeEntry(VL, false, UserTreeIdx, S);<br>
             return;<br>
           }<br>
         }<br>
<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");<br>
<br>
       for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {<br>
@@ -1524,7 +1561,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
           Operands.push_back(cast<<wbr>PHINode>(j)-><wbr>getIncomingValueForBlock(<br>
               PH->getIncomingBlock(i)));<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
     }<br>
@@ -1536,7 +1573,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       } else {<br>
         BS.cancelScheduling(VL, VL0);<br>
       }<br>
-      newTreeEntry(VL, Reuse, UserTreeIdx);<br>
+      newTreeEntry(VL, Reuse, UserTreeIdx, S);<br>
       return;<br>
     }<br>
     case Instruction::Load: {<br>
@@ -1552,7 +1589,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       if (DL->getTypeSizeInBits(<wbr>ScalarTy) !=<br>
           DL->getTypeAllocSizeInBits(<wbr>ScalarTy)) {<br>
         BS.cancelScheduling(VL, VL0);<br>
-        newTreeEntry(VL, false, UserTreeIdx);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
         DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");<br>
         return;<br>
       }<br>
@@ -1563,15 +1600,13 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         LoadInst *L = cast<LoadInst>(VL[i]);<br>
         if (!L->isSimple()) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");<br>
           return;<br>
         }<br>
       }<br>
<br>
       // Check if the loads are consecutive, reversed, or neither.<br>
-      // TODO: What we really want is to sort the loads, but for now, check<br>
-      // the two likely directions.<br>
       bool Consecutive = true;<br>
       bool ReverseConsecutive = true;<br>
       for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {<br>
@@ -1585,7 +1620,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
<br>
       if (Consecutive) {<br>
         ++NumLoadsWantToKeepOrder;<br>
-        newTreeEntry(VL, true, UserTreeIdx);<br>
+        newTreeEntry(VL, true, UserTreeIdx, S);<br>
         DEBUG(dbgs() << "SLP: added a vector of loads.\n");<br>
         return;<br>
       }<br>
@@ -1599,15 +1634,41 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
             break;<br>
           }<br>
<br>
-      BS.cancelScheduling(VL, VL0);<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
-<br>
       if (ReverseConsecutive) {<br>
-        ++NumLoadsWantToChangeOrder;<br>
         DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");<br>
-      } else {<br>
-        DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");<br>
+        ++NumLoadsWantToChangeOrder;<br>
+        BS.cancelScheduling(VL, VL0);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
+        return;<br>
+      }<br>
+<br>
+      if (VL.size() > 2) {<br>
+        bool ShuffledLoads = true;<br>
+        SmallVector<Value *, 8> Sorted;<br>
+        SmallVector<unsigned, 4> Mask;<br>
+        if (sortLoadAccesses(VL, *DL, *SE, Sorted, &Mask)) {<br>
+          auto NewVL = makeArrayRef(Sorted.begin(), Sorted.end());<br>
+          for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {<br>
+            if (!isConsecutiveAccess(NewVL[i]<wbr>, NewVL[i + 1], *DL, *SE)) {<br>
+              ShuffledLoads = false;<br>
+              break;<br>
+            }<br>
+          }<br>
+          // TODO: Tracking how many load wants to have arbitrary shuffled order<br>
+          // would be usefull.<br>
+          if (ShuffledLoads) {<br>
+            DEBUG(dbgs() << "SLP: added a vector of loads which needs "<br>
+                            "permutation of loaded lanes.\n");<br>
+            newTreeEntry(NewVL, true, UserTreeIdx, S,<br>
+                         makeArrayRef(Mask.begin(), Mask.end()), OpdNum);<br>
+            return;<br>
+          }<br>
+        }<br>
       }<br>
+<br>
+      DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");<br>
+      BS.cancelScheduling(VL, VL0);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       return;<br>
     }<br>
     case Instruction::ZExt:<br>
@@ -1627,12 +1688,12 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         Type *Ty = cast<Instruction>(VL[i])-><wbr>getOperand(0)->getType();<br>
         if (Ty != SrcTy || !isValidElementType(Ty)) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");<br>
           return;<br>
         }<br>
       }<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of casts.\n");<br>
<br>
       for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {<br>
@@ -1641,7 +1702,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         for (Value *j : VL)<br>
           Operands.push_back(cast<<wbr>Instruction>(j)->getOperand(i)<wbr>);<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
     }<br>
@@ -1655,13 +1716,13 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         if (Cmp->getPredicate() != P0 ||<br>
             Cmp->getOperand(0)->getType() != ComparedTy) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");<br>
           return;<br>
         }<br>
       }<br>
<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of compares.\n");<br>
<br>
       for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {<br>
@@ -1670,7 +1731,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         for (Value *j : VL)<br>
           Operands.push_back(cast<<wbr>Instruction>(j)->getOperand(i)<wbr>);<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
     }<br>
@@ -1693,7 +1754,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
     case Instruction::And:<br>
     case Instruction::Or:<br>
     case Instruction::Xor:<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of bin op.\n");<br>
<br>
       // Sort operands of the instructions so that each side is more likely to<br>
@@ -1702,7 +1763,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         ValueList Left, Right;<br>
         reorderInputsAccordingToOpcode<wbr>(S.Opcode, VL, Left, Right);<br>
         buildTree_rec(Left, Depth + 1, UserTreeIdx);<br>
-        buildTree_rec(Right, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);<br>
         return;<br>
       }<br>
<br>
@@ -1712,7 +1773,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         for (Value *j : VL)<br>
           Operands.push_back(cast<<wbr>Instruction>(j)->getOperand(i)<wbr>);<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
<br>
@@ -1722,7 +1783,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         if (cast<Instruction>(VL[j])-><wbr>getNumOperands() != 2) {<br>
           DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           return;<br>
         }<br>
       }<br>
@@ -1735,7 +1796,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         if (Ty0 != CurTy) {<br>
           DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           return;<br>
         }<br>
       }<br>
@@ -1747,12 +1808,12 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
           DEBUG(<br>
               dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           return;<br>
         }<br>
       }<br>
<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");<br>
       for (unsigned i = 0, e = 2; i < e; ++i) {<br>
         ValueList Operands;<br>
@@ -1760,7 +1821,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         for (Value *j : VL)<br>
           Operands.push_back(cast<<wbr>Instruction>(j)->getOperand(i)<wbr>);<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
     }<br>
@@ -1769,12 +1830,12 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)<br>
         if (!isConsecutiveAccess(VL[i], VL[i + 1], *DL, *SE)) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: Non-consecutive store.\n");<br>
           return;<br>
         }<br>
<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a vector of stores.\n");<br>
<br>
       ValueList Operands;<br>
@@ -1792,7 +1853,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       Intrinsic::ID ID = getVectorIntrinsicIDForCall(<wbr>CI, TLI);<br>
       if (!isTriviallyVectorizable(ID)) {<br>
         BS.cancelScheduling(VL, VL0);<br>
-        newTreeEntry(VL, false, UserTreeIdx);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
         DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");<br>
         return;<br>
       }<br>
@@ -1806,7 +1867,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
             getVectorIntrinsicIDForCall(<wbr>CI2, TLI) != ID ||<br>
             !CI-><wbr>hasIdenticalOperandBundleSchem<wbr>a(*CI2)) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: mismatched calls:" << *CI << "!=" << *VL[i]<br>
                        << "\n");<br>
           return;<br>
@@ -1817,7 +1878,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
           Value *A1J = CI2->getArgOperand(1);<br>
           if (A1I != A1J) {<br>
             BS.cancelScheduling(VL, VL0);<br>
-            newTreeEntry(VL, false, UserTreeIdx);<br>
+            newTreeEntry(VL, false, UserTreeIdx, S);<br>
             DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI<br>
                          << " argument "<< A1I<<"!=" << A1J<br>
                          << "\n");<br>
@@ -1830,14 +1891,14 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
                         CI->op_begin() + CI->getBundleOperandsEndIndex(<wbr>),<br>
                         CI2->op_begin() + CI2-><wbr>getBundleOperandsStartIndex())<wbr>) {<br>
           BS.cancelScheduling(VL, VL0);<br>
-          newTreeEntry(VL, false, UserTreeIdx);<br>
+          newTreeEntry(VL, false, UserTreeIdx, S);<br>
           DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="<br>
                        << *VL[i] << '\n');<br>
           return;<br>
         }<br>
       }<br>
<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {<br>
         ValueList Operands;<br>
         // Prepare the operand vector.<br>
@@ -1845,7 +1906,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
           CallInst *CI2 = dyn_cast<CallInst>(j);<br>
           Operands.push_back(CI2-><wbr>getArgOperand(i));<br>
         }<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
     }<br>
@@ -1854,11 +1915,11 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
       // then do not vectorize this instruction.<br>
       if (!S.IsAltShuffle) {<br>
         BS.cancelScheduling(VL, VL0);<br>
-        newTreeEntry(VL, false, UserTreeIdx);<br>
+        newTreeEntry(VL, false, UserTreeIdx, S);<br>
         DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");<br>
         return;<br>
       }<br>
-      newTreeEntry(VL, true, UserTreeIdx);<br>
+      newTreeEntry(VL, true, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");<br>
<br>
       // Reorder operands if reordering would enable vectorization.<br>
@@ -1866,7 +1927,7 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         ValueList Left, Right;<br>
         reorderAltShuffleOperands(S.<wbr>Opcode, VL, Left, Right);<br>
         buildTree_rec(Left, Depth + 1, UserTreeIdx);<br>
-        buildTree_rec(Right, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);<br>
         return;<br>
       }<br>
<br>
@@ -1876,13 +1937,13 @@ void BoUpSLP::buildTree_rec(<wbr>ArrayRef<Val<br>
         for (Value *j : VL)<br>
           Operands.push_back(cast<<wbr>Instruction>(j)->getOperand(i)<wbr>);<br>
<br>
-        buildTree_rec(Operands, Depth + 1, UserTreeIdx);<br>
+        buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);<br>
       }<br>
       return;<br>
<br>
     default:<br>
       BS.cancelScheduling(VL, VL0);<br>
-      newTreeEntry(VL, false, UserTreeIdx);<br>
+      newTreeEntry(VL, false, UserTreeIdx, S);<br>
       DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");<br>
       return;<br>
   }<br>
@@ -2720,12 +2781,15 @@ Value *BoUpSLP::alreadyVectorized(<wbr>ArrayR<br>
   return nullptr;<br>
 }<br>
<br>
-Value *BoUpSLP::vectorizeTree(<wbr>ArrayRef<Value *> VL) {<br>
+Value *BoUpSLP::vectorizeTree(<wbr>ArrayRef<Value *> VL, int OpdNum, int UserIndx) {<br>
   InstructionsState S = getSameOpcode(VL);<br>
   if (S.Opcode) {<br>
     if (TreeEntry *E = getTreeEntry(S.OpValue)) {<br>
-      if (E->isSame(VL))<br>
-        return vectorizeTree(E);<br>
+      TreeEntry *UserTreeEntry = &VectorizableTree[UserIndx];<br>
+      if (E->isSame(VL) ||<br>
+          (UserTreeEntry && !UserTreeEntry->ShuffleMask[<wbr>OpdNum].empty() &&<br>
+           E->isFoundJumbled(VL, *DL, *SE)))<br>
+        return vectorizeTree(E, OpdNum, UserIndx);<br>
     }<br>
   }<br>
<br>
@@ -2737,9 +2801,11 @@ Value *BoUpSLP::vectorizeTree(<wbr>ArrayRef<V<br>
   return Gather(VL, VecTy);<br>
 }<br>
<br>
-Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry *E) {<br>
+Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry *E, int OpdNum, int UserIndx) {<br>
   IRBuilder<>::InsertPointGuard Guard(Builder);<br>
<br>
+  int CurrIndx = ScalarToTreeEntry[E->Scalars[<wbr>0]];<br>
+  TreeEntry *UserTreeEntry = nullptr;<br>
   if (E->VectorizedValue) {<br>
     DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");<br>
     return E->VectorizedValue;<br>
@@ -2788,7 +2854,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
         Builder.SetInsertPoint(IBB-><wbr>getTerminator());<br>
         Builder.<wbr>SetCurrentDebugLocation(PH-><wbr>getDebugLoc());<br>
-        Value *Vec = vectorizeTree(Operands);<br>
+        Value *Vec = vectorizeTree(Operands, i, CurrIndx);<br>
         NewPhi->addIncoming(Vec, IBB);<br>
       }<br>
<br>
@@ -2841,7 +2907,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *InVec = vectorizeTree(INVL);<br>
+      Value *InVec = vectorizeTree(INVL, 0, CurrIndx);<br>
<br>
       if (Value *V = alreadyVectorized(E->Scalars, VL0))<br>
         return V;<br>
@@ -2862,8 +2928,8 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *L = vectorizeTree(LHSV);<br>
-      Value *R = vectorizeTree(RHSV);<br>
+      Value *L = vectorizeTree(LHSV, 0, CurrIndx);<br>
+      Value *R = vectorizeTree(RHSV, 1, CurrIndx);<br>
<br>
       if (Value *V = alreadyVectorized(E->Scalars, VL0))<br>
         return V;<br>
@@ -2890,9 +2956,9 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *Cond = vectorizeTree(CondVec);<br>
-      Value *True = vectorizeTree(TrueVec);<br>
-      Value *False = vectorizeTree(FalseVec);<br>
+      Value *Cond = vectorizeTree(CondVec, 0, CurrIndx);<br>
+      Value *True = vectorizeTree(TrueVec, 1, CurrIndx);<br>
+      Value *False = vectorizeTree(FalseVec, 2, CurrIndx);<br>
<br>
       if (Value *V = alreadyVectorized(E->Scalars, VL0))<br>
         return V;<br>
@@ -2933,8 +2999,8 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *LHS = vectorizeTree(LHSVL);<br>
-      Value *RHS = vectorizeTree(RHSVL);<br>
+      Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);<br>
+      Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);<br>
<br>
       if (Value *V = alreadyVectorized(E->Scalars, VL0))<br>
         return V;<br>
@@ -2955,7 +3021,17 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
       // sink them all the way down past store instructions.<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      LoadInst *LI = cast<LoadInst>(VL0);<br>
+      if(UserIndx != -1) {<br>
+        UserTreeEntry = &VectorizableTree[UserIndx];<br>
+      }<br>
+<br>
+      LoadInst *LI = NULL;<br>
+      if (UserTreeEntry && !UserTreeEntry->ShuffleMask[<wbr>OpdNum].empty()) {<br>
+        LI = cast<LoadInst>(E->Scalars[0]);<br>
+      } else {<br>
+        LI = cast<LoadInst>(VL0);<br>
+      }<br>
+<br>
       Type *ScalarLoadTy = LI->getType();<br>
       unsigned AS = LI->getPointerAddressSpace();<br>
<br>
@@ -2977,7 +3053,24 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
       LI->setAlignment(Alignment);<br>
       E->VectorizedValue = LI;<br>
       ++NumVectorInstructions;<br>
-      return propagateMetadata(LI, E->Scalars);<br>
+      propagateMetadata(LI, E->Scalars);<br>
+<br>
+      if (UserTreeEntry && !UserTreeEntry->ShuffleMask[<wbr>OpdNum].empty()) {<br>
+        SmallVector<Constant *, 8> Mask;<br>
+        for (unsigned Lane = 0, LE = UserTreeEntry->ShuffleMask[<wbr>OpdNum].size();<br>
+             Lane != LE; ++Lane) {<br>
+          Mask.push_back(<br>
+              Builder.getInt32(<wbr>UserTreeEntry->ShuffleMask[<wbr>OpdNum][Lane]));<br>
+        }<br>
+        // Generate shuffle for jumbled memory access<br>
+        Value *Undef = UndefValue::get(VecTy);<br>
+        Value *Shuf = Builder.CreateShuffleVector((<wbr>Value *)LI, Undef,<br>
+                                                  ConstantVector::get(Mask));<br>
+        E->VectorizedValue = Shuf;<br>
+        ++NumVectorInstructions;<br>
+        return Shuf;<br>
+      }<br>
+      return LI;<br>
     }<br>
     case Instruction::Store: {<br>
       StoreInst *SI = cast<StoreInst>(VL0);<br>
@@ -2990,7 +3083,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *VecValue = vectorizeTree(<wbr>ScalarStoreValues);<br>
+      Value *VecValue = vectorizeTree(<wbr>ScalarStoreValues, 0, CurrIndx);<br>
       Value *ScalarPtr = SI->getPointerOperand();<br>
       Value *VecPtr = Builder.CreateBitCast(<wbr>ScalarPtr, VecTy->getPointerTo(AS));<br>
       StoreInst *S = Builder.CreateStore(VecValue, VecPtr);<br>
@@ -3016,7 +3109,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
       for (Value *V : E->Scalars)<br>
         Op0VL.push_back(cast<<wbr>GetElementPtrInst>(V)-><wbr>getOperand(0));<br>
<br>
-      Value *Op0 = vectorizeTree(Op0VL);<br>
+      Value *Op0 = vectorizeTree(Op0VL, 0, CurrIndx);<br>
<br>
       std::vector<Value *> OpVecs;<br>
       for (int j = 1, e = cast<GetElementPtrInst>(VL0)-><wbr>getNumOperands(); j < e;<br>
@@ -3025,7 +3118,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
         for (Value *V : E->Scalars)<br>
           OpVL.push_back(cast<<wbr>GetElementPtrInst>(V)-><wbr>getOperand(j));<br>
<br>
-        Value *OpVec = vectorizeTree(OpVL);<br>
+        Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);<br>
         OpVecs.push_back(OpVec);<br>
       }<br>
<br>
@@ -3064,7 +3157,7 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
           OpVL.push_back(CEI-><wbr>getArgOperand(j));<br>
         }<br>
<br>
-        Value *OpVec = vectorizeTree(OpVL);<br>
+        Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);<br>
         DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");<br>
         OpVecs.push_back(OpVec);<br>
       }<br>
@@ -3095,8 +3188,8 @@ Value *BoUpSLP::vectorizeTree(<wbr>TreeEntry<br>
       reorderAltShuffleOperands(S.<wbr>Opcode, E->Scalars, LHSVL, RHSVL);<br>
       setInsertPointAfterBundle(E-><wbr>Scalars, VL0);<br>
<br>
-      Value *LHS = vectorizeTree(LHSVL);<br>
-      Value *RHS = vectorizeTree(RHSVL);<br>
+      Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);<br>
+      Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);<br>
<br>
       if (Value *V = alreadyVectorized(E->Scalars, VL0))<br>
         return V;<br>
@@ -3198,7 +3291,13 @@ BoUpSLP::vectorizeTree(<wbr>ExtraValueToDebug<br>
     assert(E && "Invalid scalar");<br>
     assert(!E->NeedToGather && "Extracting from a gather list");<br>
<br>
-    Value *Vec = E->VectorizedValue;<br>
+    Value *Vec = nullptr;<br>
+    if ((Vec = dyn_cast<ShuffleVectorInst>(E-<wbr>>VectorizedValue)) &&<br>
+        dyn_cast<LoadInst>(cast<<wbr>Instruction>(Vec)->getOperand(<wbr>0))) {<br>
+      Vec = cast<Instruction>(E-><wbr>VectorizedValue)->getOperand(<wbr>0);<br>
+    } else {<br>
+      Vec = E->VectorizedValue;<br>
+    }<br>
     assert(Vec && "Can't find vectorizable value");<br>
<br>
     Value *Lane = Builder.getInt32(ExternalUse.<wbr>Lane);<br>
<br>
Modified: llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-multiuse.ll<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/test/<wbr>Transforms/SLPVectorizer/X86/<wbr>jumbled-load-multiuse.ll?rev=<wbr>313736&r1=313735&r2=313736&<wbr>view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-multiuse.ll (original)<br>
+++ llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-multiuse.ll Wed Sep 20 01:18:28 2017<br>
@@ -11,20 +11,16 @@<br>
     define i32 @fn1() {<br>
 ; CHECK-LABEL: @fn1(<br>
 ; CHECK-NEXT:  entry:<br>
-; CHECK-NEXT:    [[TMP0:%.*]] = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4<br>
-; CHECK-NEXT:    [[TMP1:%.*]] = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4<br>
-; CHECK-NEXT:    [[TMP2:%.*]] = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 2), align 4<br>
-; CHECK-NEXT:    [[TMP3:%.*]] = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4<br>
-; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0<br>
-; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP2]], i32 1<br>
-; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP3]], i32 2<br>
-; CHECK-NEXT:    [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0]], i32 3<br>
-; CHECK-NEXT:    [[TMP8:%.*]] = icmp sgt <4 x i32> [[TMP7]], zeroinitializer<br>
-; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 ()* @fn1 to i32), i32 1<br>
-; CHECK-NEXT:    [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 ptrtoint (i32 ()* @fn1 to i32), i32 2<br>
-; CHECK-NEXT:    [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 8, i32 3<br>
-; CHECK-NEXT:    [[TMP12:%.*]] = select <4 x i1> [[TMP8]], <4 x i32> [[TMP11]], <4 x i32> <i32 6, i32 0, i32 0, i32 0><br>
-; CHECK-NEXT:    store <4 x i32> [[TMP12]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4<br>
+; CHECK-NEXT:    [[TMP0:%.*]] = load <4 x i32>, <4 x i32>* bitcast ([4 x i32]* @b to <4 x i32>*), align 4<br>
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0><br>
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer<br>
+; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1<br>
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0<br>
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 ()* @fn1 to i32), i32 1<br>
+; CHECK-NEXT:    [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 ()* @fn1 to i32), i32 2<br>
+; CHECK-NEXT:    [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3<br>
+; CHECK-NEXT:    [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0><br>
+; CHECK-NEXT:    store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4<br>
 ; CHECK-NEXT:    ret i32 0<br>
 ;<br>
   entry:<br>
<br>
Added: llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-shuffle-placement.ll<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll?rev=313736&view=auto" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/test/<wbr>Transforms/SLPVectorizer/X86/<wbr>jumbled-load-shuffle-<wbr>placement.ll?rev=313736&view=<wbr>auto</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-shuffle-placement.ll (added)<br>
+++ llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load-shuffle-placement.ll Wed Sep 20 01:18:28 2017<br>
@@ -0,0 +1,68 @@<br>
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py<br>
+; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer | FileCheck %s<br>
+<br>
+<br>
+;void jumble (int * restrict A, int * restrict B) {<br>
+  ;  int tmp0 = A[10]*A[0];<br>
+  ;  int tmp1 = A[11]*A[1];<br>
+  ;  int tmp2 = A[12]*A[3];<br>
+  ;  int tmp3 = A[13]*A[2];<br>
+  ;  B[0] = tmp0;<br>
+  ;  B[1] = tmp1;<br>
+  ;  B[2] = tmp2;<br>
+  ;  B[3] = tmp3;<br>
+  ;}<br>
+  ; Function Attrs: norecurse nounwind uwtable<br>
+  define void @jumble(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {<br>
+; CHECK-LABEL: @jumble(<br>
+; CHECK-NEXT:  entry:<br>
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[A:%.*]], i64 10<br>
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 11<br>
+; CHECK-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 1<br>
+; CHECK-NEXT:    [[ARRAYIDX5:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 12<br>
+; CHECK-NEXT:    [[ARRAYIDX6:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 3<br>
+; CHECK-NEXT:    [[ARRAYIDX8:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 13<br>
+; CHECK-NEXT:    [[TMP0:%.*]] = bitcast i32* [[ARRAYIDX]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP1:%.*]] = load <4 x i32>, <4 x i32>* [[TMP0]], align 4<br>
+; CHECK-NEXT:    [[ARRAYIDX9:%.*]] = getelementptr inbounds i32, i32* [[A]], i64 2<br>
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i32* [[A]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, <4 x i32>* [[TMP2]], align 4<br>
+; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2><br>
+; CHECK-NEXT:    [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP1]]<br>
+; CHECK-NEXT:    [[ARRAYIDX12:%.*]] = getelementptr inbounds i32, i32* [[B:%.*]], i64 1<br>
+; CHECK-NEXT:    [[ARRAYIDX13:%.*]] = getelementptr inbounds i32, i32* [[B]], i64 2<br>
+; CHECK-NEXT:    [[ARRAYIDX14:%.*]] = getelementptr inbounds i32, i32* [[B]], i64 3<br>
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32* [[B]] to <4 x i32>*<br>
+; CHECK-NEXT:    store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4<br>
+; CHECK-NEXT:    ret void<br>
+;<br>
+entry:<br>
+  %arrayidx = getelementptr inbounds i32, i32* %A, i64 10<br>
+  %0 = load i32, i32* %arrayidx, align 4<br>
+  %1 = load i32, i32* %A, align 4<br>
+  %mul = mul nsw i32 %1, %0<br>
+  %arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11<br>
+  %2 = load i32, i32* %arrayidx2, align 4<br>
+  %arrayidx3 = getelementptr inbounds i32, i32* %A, i64 1<br>
+  %3 = load i32, i32* %arrayidx3, align 4<br>
+  %mul4 = mul nsw i32 %3, %2<br>
+  %arrayidx5 = getelementptr inbounds i32, i32* %A, i64 12<br>
+  %4 = load i32, i32* %arrayidx5, align 4<br>
+  %arrayidx6 = getelementptr inbounds i32, i32* %A, i64 3<br>
+  %5 = load i32, i32* %arrayidx6, align 4<br>
+  %mul7 = mul nsw i32 %5, %4<br>
+  %arrayidx8 = getelementptr inbounds i32, i32* %A, i64 13<br>
+  %6 = load i32, i32* %arrayidx8, align 4<br>
+  %arrayidx9 = getelementptr inbounds i32, i32* %A, i64 2<br>
+  %7 = load i32, i32* %arrayidx9, align 4<br>
+  %mul10 = mul nsw i32 %7, %6<br>
+  store i32 %mul, i32* %B, align 4<br>
+  %arrayidx12 = getelementptr inbounds i32, i32* %B, i64 1<br>
+  store i32 %mul4, i32* %arrayidx12, align 4<br>
+  %arrayidx13 = getelementptr inbounds i32, i32* %B, i64 2<br>
+  store i32 %mul7, i32* %arrayidx13, align 4<br>
+  %arrayidx14 = getelementptr inbounds i32, i32* %B, i64 3<br>
+  store i32 %mul10, i32* %arrayidx14, align 4<br>
+  ret void<br>
+  }<br>
+<br>
<br>
Modified: llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load.ll<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load.ll?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/test/<wbr>Transforms/SLPVectorizer/X86/<wbr>jumbled-load.ll?rev=313736&r1=<wbr>313735&r2=313736&view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load.ll (original)<br>
+++ llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/jumbled-<wbr>load.ll Wed Sep 20 01:18:28 2017<br>
@@ -5,34 +5,27 @@<br>
<br>
 define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {<br>
 ; CHECK-LABEL: @jumbled-load(<br>
-; CHECK-NEXT:    [[IN_ADDR:%.*]] = getelementptr inbounds i32, i32* %in, i64 0<br>
-; CHECK-NEXT:    [[LOAD_1:%.*]] = load i32, i32* [[IN_ADDR]], align 4<br>
+; CHECK-NEXT:    [[IN_ADDR:%.*]] = getelementptr inbounds i32, i32* [[IN:%.*]], i64 0<br>
 ; CHECK-NEXT:    [[GEP_1:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 3<br>
-; CHECK-NEXT:    [[LOAD_2:%.*]] = load i32, i32* [[GEP_1]], align 4<br>
 ; CHECK-NEXT:    [[GEP_2:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 1<br>
-; CHECK-NEXT:    [[LOAD_3:%.*]] = load i32, i32* [[GEP_2]], align 4<br>
 ; CHECK-NEXT:    [[GEP_3:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 2<br>
-; CHECK-NEXT:    [[LOAD_4:%.*]] = load i32, i32* [[GEP_3]], align 4<br>
-; CHECK-NEXT:    [[INN_ADDR:%.*]] = getelementptr inbounds i32, i32* %inn, i64 0<br>
-; CHECK-NEXT:    [[LOAD_5:%.*]] = load i32, i32* [[INN_ADDR]], align 4<br>
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i32* [[IN_ADDR]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x i32>, <4 x i32>* [[TMP1]], align 4<br>
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0><br>
+; CHECK-NEXT:    [[INN_ADDR:%.*]] = getelementptr inbounds i32, i32* [[INN:%.*]], i64 0<br>
 ; CHECK-NEXT:    [[GEP_4:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 2<br>
-; CHECK-NEXT:    [[LOAD_6:%.*]] = load i32, i32* [[GEP_4]], align 4<br>
 ; CHECK-NEXT:    [[GEP_5:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 3<br>
-; CHECK-NEXT:    [[LOAD_7:%.*]] = load i32, i32* [[GEP_5]], align 4<br>
 ; CHECK-NEXT:    [[GEP_6:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 1<br>
-; CHECK-NEXT:    [[LOAD_8:%.*]] = load i32, i32* [[GEP_6]], align 4<br>
-; CHECK-NEXT:    [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]<br>
-; CHECK-NEXT:    [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]<br>
-; CHECK-NEXT:    [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]<br>
-; CHECK-NEXT:    [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]<br>
-; CHECK-NEXT:    [[GEP_7:%.*]] = getelementptr inbounds i32, i32* %out, i64 0<br>
-; CHECK-NEXT:    store i32 [[MUL_1]], i32* [[GEP_7]], align 4<br>
-; CHECK-NEXT:    [[GEP_8:%.*]] = getelementptr inbounds i32, i32* %out, i64 1<br>
-; CHECK-NEXT:    store i32 [[MUL_2]], i32* [[GEP_8]], align 4<br>
-; CHECK-NEXT:    [[GEP_9:%.*]] = getelementptr inbounds i32, i32* %out, i64 2<br>
-; CHECK-NEXT:    store i32 [[MUL_3]], i32* [[GEP_9]], align 4<br>
-; CHECK-NEXT:    [[GEP_10:%.*]] = getelementptr inbounds i32, i32* %out, i64 3<br>
-; CHECK-NEXT:    store i32 [[MUL_4]], i32* [[GEP_10]], align 4<br>
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32* [[INN_ADDR]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP5:%.*]] = load <4 x i32>, <4 x i32>* [[TMP4]], align 4<br>
+; CHECK-NEXT:    [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2><br>
+; CHECK-NEXT:    [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]<br>
+; CHECK-NEXT:    [[GEP_7:%.*]] = getelementptr inbounds i32, i32* [[OUT:%.*]], i64 0<br>
+; CHECK-NEXT:    [[GEP_8:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 1<br>
+; CHECK-NEXT:    [[GEP_9:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 2<br>
+; CHECK-NEXT:    [[GEP_10:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 3<br>
+; CHECK-NEXT:    [[TMP8:%.*]] = bitcast i32* [[GEP_7]] to <4 x i32>*<br>
+; CHECK-NEXT:    store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4<br>
 ; CHECK-NEXT:    ret i32 undef<br>
 ;<br>
   %in.addr = getelementptr inbounds i32, i32* %in, i64 0<br>
<br>
Modified: llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/store-<wbr>jumbled.ll<br>
URL: <a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/SLPVectorizer/X86/store-jumbled.ll?rev=313736&r1=313735&r2=313736&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-<wbr>project/llvm/trunk/test/<wbr>Transforms/SLPVectorizer/X86/<wbr>store-jumbled.ll?rev=313736&<wbr>r1=313735&r2=313736&view=diff</a><br>
==============================<wbr>==============================<wbr>==================<br>
--- llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/store-<wbr>jumbled.ll (original)<br>
+++ llvm/trunk/test/Transforms/<wbr>SLPVectorizer/X86/store-<wbr>jumbled.ll Wed Sep 20 01:18:28 2017<br>
@@ -6,33 +6,26 @@<br>
 define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {<br>
 ; CHECK-LABEL: @jumbled-load(<br>
 ; CHECK-NEXT:    [[IN_ADDR:%.*]] = getelementptr inbounds i32, i32* [[IN:%.*]], i64 0<br>
-; CHECK-NEXT:    [[LOAD_1:%.*]] = load i32, i32* [[IN_ADDR]], align 4<br>
 ; CHECK-NEXT:    [[GEP_1:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 1<br>
-; CHECK-NEXT:    [[LOAD_2:%.*]] = load i32, i32* [[GEP_1]], align 4<br>
 ; CHECK-NEXT:    [[GEP_2:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 2<br>
-; CHECK-NEXT:    [[LOAD_3:%.*]] = load i32, i32* [[GEP_2]], align 4<br>
 ; CHECK-NEXT:    [[GEP_3:%.*]] = getelementptr inbounds i32, i32* [[IN_ADDR]], i64 3<br>
-; CHECK-NEXT:    [[LOAD_4:%.*]] = load i32, i32* [[GEP_3]], align 4<br>
+; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i32* [[IN_ADDR]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x i32>, <4 x i32>* [[TMP1]], align 4<br>
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2><br>
 ; CHECK-NEXT:    [[INN_ADDR:%.*]] = getelementptr inbounds i32, i32* [[INN:%.*]], i64 0<br>
-; CHECK-NEXT:    [[LOAD_5:%.*]] = load i32, i32* [[INN_ADDR]], align 4<br>
 ; CHECK-NEXT:    [[GEP_4:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 1<br>
-; CHECK-NEXT:    [[LOAD_6:%.*]] = load i32, i32* [[GEP_4]], align 4<br>
 ; CHECK-NEXT:    [[GEP_5:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 2<br>
-; CHECK-NEXT:    [[LOAD_7:%.*]] = load i32, i32* [[GEP_5]], align 4<br>
 ; CHECK-NEXT:    [[GEP_6:%.*]] = getelementptr inbounds i32, i32* [[INN_ADDR]], i64 3<br>
-; CHECK-NEXT:    [[LOAD_8:%.*]] = load i32, i32* [[GEP_6]], align 4<br>
-; CHECK-NEXT:    [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]<br>
-; CHECK-NEXT:    [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]<br>
-; CHECK-NEXT:    [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]<br>
-; CHECK-NEXT:    [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]<br>
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32* [[INN_ADDR]] to <4 x i32>*<br>
+; CHECK-NEXT:    [[TMP5:%.*]] = load <4 x i32>, <4 x i32>* [[TMP4]], align 4<br>
+; CHECK-NEXT:    [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2><br>
+; CHECK-NEXT:    [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]<br>
 ; CHECK-NEXT:    [[GEP_7:%.*]] = getelementptr inbounds i32, i32* [[OUT:%.*]], i64 0<br>
 ; CHECK-NEXT:    [[GEP_8:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 1<br>
 ; CHECK-NEXT:    [[GEP_9:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 2<br>
 ; CHECK-NEXT:    [[GEP_10:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 3<br>
-; CHECK-NEXT:    store i32 [[MUL_1]], i32* [[GEP_9]], align 4<br>
-; CHECK-NEXT:    store i32 [[MUL_2]], i32* [[GEP_7]], align 4<br>
-; CHECK-NEXT:    store i32 [[MUL_3]], i32* [[GEP_10]], align 4<br>
-; CHECK-NEXT:    store i32 [[MUL_4]], i32* [[GEP_8]], align 4<br>
+; CHECK-NEXT:    [[TMP8:%.*]] = bitcast i32* [[GEP_7]] to <4 x i32>*<br>
+; CHECK-NEXT:    store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4<br>
 ; CHECK-NEXT:    ret i32 undef<br>
 ;<br>
   %in.addr = getelementptr inbounds i32, i32* %in, i64 0<br>
<br>
<br>
______________________________<wbr>_________________<br>
llvm-commits mailing list<br>
<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>
</blockquote></div><br></div>