<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Tue, Dec 13, 2016 at 6:31 AM Artur Pilipenko via llvm-commits <<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Author: apilipenko<br class="gmail_msg">
Date: Tue Dec 13 08:21:14 2016<br class="gmail_msg">
New Revision: 289538<br class="gmail_msg">
<br class="gmail_msg">
URL: <a href="http://llvm.org/viewvc/llvm-project?rev=289538&view=rev" rel="noreferrer" class="gmail_msg" target="_blank">http://llvm.org/viewvc/llvm-project?rev=289538&view=rev</a><br class="gmail_msg">
Log:<br class="gmail_msg">
[DAGCombiner] Match load by bytes idiom and fold it into a single load<br class="gmail_msg"></blockquote><div><br></div><div>Really cool transform, but unfortunately, this patch has a bad bug in it that is a bit tricky to trigger. Consider the test case:<br></div><div><br></div><div><div>target triple = "x86_64-unknown-linux-gnu"</div><div><br></div><div>define void @crash(i8* %src1, i8* %src2, i64* %dst) {</div><div>entry:</div><div> %load1 = load i8, i8* %src1, align 1</div><div> %conv46 = zext i8 %load1 to i32</div><div> %shl47 = shl i32 %conv46, 56</div><div> %or55 = or i32 %shl47, 0</div><div> %load2 = load i8, i8* %src2, align 1</div><div> %conv57 = zext i8 %load2 to i32</div><div> %shl58 = shl i32 %conv57, 32</div><div> %or59 = or i32 %or55, %shl58</div><div> %or74 = or i32 %or59, 0</div><div> %conv75 = sext i32 %or74 to i64</div><div> store i64 %conv75, i64* %dst, align 8</div><div> ret void</div><div>}</div></div><div><br></div><div>This will crash when it tries to back and end iterator past the beginning of a vector. You need to check for bad shift widths.</div><div><br></div><div><br></div><div>However, there are several other serious problems with the patch that need to be addressed so I'm going to revert it for now to unblock folks (we're hitting this in real code) and let you post an updated patch for review. Please get the updated patch re-reviewed, as I think this needs some more detailed review around data structures and algorithms used. See my (very rough) comments below...<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">==============================================================================<br class="gmail_msg">
--- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)<br class="gmail_msg">
+++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Dec 13 08:21:14 2016<br class="gmail_msg">
@@ -20,6 +20,7 @@<br class="gmail_msg">
#include "llvm/ADT/SetVector.h"<br class="gmail_msg">
#include "llvm/ADT/SmallBitVector.h"<br class="gmail_msg">
#include "llvm/ADT/SmallPtrSet.h"<br class="gmail_msg">
+#include "llvm/ADT/SmallSet.h"<br class="gmail_msg">
#include "llvm/ADT/Statistic.h"<br class="gmail_msg">
#include "llvm/Analysis/AliasAnalysis.h"<br class="gmail_msg">
#include "llvm/CodeGen/MachineFrameInfo.h"<br class="gmail_msg">
@@ -375,6 +376,7 @@ namespace {<br class="gmail_msg">
unsigned PosOpcode, unsigned NegOpcode,<br class="gmail_msg">
const SDLoc &DL);<br class="gmail_msg">
SDNode *MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);<br class="gmail_msg">
+ SDValue MatchLoadCombine(SDNode *N);<br class="gmail_msg">
SDValue ReduceLoadWidth(SDNode *N);<br class="gmail_msg">
SDValue ReduceLoadOpStoreWidth(SDNode *N);<br class="gmail_msg">
SDValue splitMergedValStore(StoreSDNode *ST);<br class="gmail_msg">
@@ -3969,6 +3971,9 @@ SDValue DAGCombiner::visitOR(SDNode *N)<br class="gmail_msg">
if (SDNode *Rot = MatchRotate(N0, N1, SDLoc(N)))<br class="gmail_msg">
return SDValue(Rot, 0);<br class="gmail_msg">
<br class="gmail_msg">
+ if (SDValue Load = MatchLoadCombine(N))<br class="gmail_msg">
+ return Load;<br class="gmail_msg">
+<br class="gmail_msg">
// Simplify the operands using demanded-bits information.<br class="gmail_msg">
if (!VT.isVector() &&<br class="gmail_msg">
SimplifyDemandedBits(SDValue(N, 0)))<br class="gmail_msg">
@@ -4340,6 +4345,277 @@ struct BaseIndexOffset {<br class="gmail_msg">
};<br class="gmail_msg">
} // namespace<br class="gmail_msg">
<br class="gmail_msg">
+namespace {<br class="gmail_msg">
+/// Represents the origin of an individual byte in load combine pattern. The<br class="gmail_msg">
+/// value of the byte is either unknown, zero or comes from memory.<br class="gmail_msg">
+struct ByteProvider {<br class="gmail_msg">
+ enum ProviderTy {<br class="gmail_msg">
+ Unknown,<br class="gmail_msg">
+ ZeroConstant,<br class="gmail_msg">
+ Memory<br class="gmail_msg">
+ };<br class="gmail_msg">
+<br class="gmail_msg">
+ ProviderTy Kind;<br class="gmail_msg"></blockquote><div><br></div><div>Putting this before a pointer means you'll have almost all padding bits here.. =[</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+ // Load and ByteOffset are set for Memory providers only.<br class="gmail_msg">
+ // Load represents the node which loads the byte from memory.<br class="gmail_msg">
+ // ByteOffset is the offset of the byte in the value produced by the load.<br class="gmail_msg">
+ LoadSDNode *Load;<br class="gmail_msg">
+ unsigned ByteOffset;<br class="gmail_msg"></blockquote><div><br></div><div>Most of the providers are not loads though? This data structure seems really inefficient for the common case.</div><div><br></div><div>Also, ByteOffset seems completely redundant. It is always set to the same value as the index into the vector below. Can you eliminate it or avoid it somehow?</div><div><br></div><div>Then, could you use a PointerIntPair so that this whole thing becomes the size of a pointer? That would seem much more efficient.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">+/// Recursively traverses the expression collecting the origin of individual<br class="gmail_msg">
+/// bytes of the given value. For all the values except the root of the<br class="gmail_msg">
+/// expression verifies that it doesn't have uses outside of the expression.<br class="gmail_msg">
+const Optional<SmallVector<ByteProvider, 4> ><br class="gmail_msg"></blockquote><div><br></div><div>Please use clang-format with the latest LLVM settings. We don't need the space before the closing '>' any more.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+collectByteProviders(SDValue Op, bool CheckNumberOfUses = false) {<br class="gmail_msg">
+ if (CheckNumberOfUses && !Op.hasOneUse())<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+<br class="gmail_msg">
+ unsigned BitWidth = Op.getScalarValueSizeInBits();<br class="gmail_msg">
+ if (BitWidth % 8 != 0)<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+ unsigned ByteWidth = BitWidth / 8;<br class="gmail_msg"></blockquote><div><br></div><div>Do you want to bound the size you consider here? Especially as this is in the backend you should know the maximal size you can fuse together... But maybe if you make this more efficient you can handle the full width of any integer.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+<br class="gmail_msg">
+ switch (Op.getOpcode()) {<br class="gmail_msg">
+ case ISD::OR: {<br class="gmail_msg">
+ auto LHS = collectByteProviders(Op->getOperand(0),<br class="gmail_msg">
+ /*CheckNumberOfUses=*/true);<br class="gmail_msg">
+ auto RHS = collectByteProviders(Op->getOperand(1),<br class="gmail_msg">
+ /*CheckNumberOfUses=*/true);<br class="gmail_msg">
+ if (!LHS || !RHS)<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+<br class="gmail_msg">
+ auto OR = [](ByteProvider LHS, ByteProvider RHS) {<br class="gmail_msg">
+ if (LHS == RHS)<br class="gmail_msg">
+ return LHS;<br class="gmail_msg">
+ if (LHS.Kind == ByteProvider::Unknown ||<br class="gmail_msg">
+ RHS.Kind == ByteProvider::Unknown)<br class="gmail_msg">
+ return ByteProvider::getUnknown();<br class="gmail_msg">
+ if (LHS.Kind == ByteProvider::Memory && RHS.Kind == ByteProvider::Memory)<br class="gmail_msg">
+ return ByteProvider::getUnknown(); </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+<br class="gmail_msg">
+ if (LHS.Kind == ByteProvider::Memory)<br class="gmail_msg">
+ return LHS;<br class="gmail_msg">
+ else<br class="gmail_msg">
+ return RHS;<br class="gmail_msg">
+ };<br class="gmail_msg">
+<br class="gmail_msg">
+ SmallVector<ByteProvider, 4> Result(ByteWidth);<br class="gmail_msg">
+ for (unsigned i = 0; i < LHS->size(); i++)<br class="gmail_msg">
+ Result[i] = OR(LHS.getValue()[i], RHS.getValue()[i]);<br class="gmail_msg"></blockquote><div><br></div><div>This is really inefficient.</div><div><br></div><div>You build the entire tree for each layer even if you throw it away again. And because you recurse infinitely upwards, it is quadratic for tall chains.</div><div><br></div><div>For combines that recurse upward to not be quadratic in nature, they must:</div><div>- Combine away the nodes they recurse upward through so we don't re-visit them, or</div><div>- Have a constant factor limit on the depth of recursion</div><div><br></div><div>And to make matters worse when you have a load of more than one byte, you build the information for the load redundantly, once for each byte.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
+<br class="gmail_msg">
+ return Result;<br class="gmail_msg">
+ }<br class="gmail_msg">
+ case ISD::SHL: {<br class="gmail_msg">
+ auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));<br class="gmail_msg">
+ if (!ShiftOp)<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+<br class="gmail_msg">
+ uint64_t BitShift = ShiftOp->getZExtValue();<br class="gmail_msg">
+ if (BitShift % 8 != 0)<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+ uint64_t ByteShift = BitShift / 8;<br class="gmail_msg">
+<br class="gmail_msg">
+ auto Original = collectByteProviders(Op->getOperand(0),<br class="gmail_msg">
+ /*CheckNumberOfUses=*/true);<br class="gmail_msg">
+ if (!Original)<br class="gmail_msg">
+ return None;<br class="gmail_msg">
+<br class="gmail_msg">
+ SmallVector<ByteProvider, 4> Result;<br class="gmail_msg">
+ Result.insert(Result.begin(), ByteShift, ByteProvider::getZero());<br class="gmail_msg">
+ Result.insert(Result.end(), Original->begin(),<br class="gmail_msg">
+ std::prev(Original->end(), ByteShift));<br class="gmail_msg">
+ assert(Result.size() == ByteWidth && "sanity");<br class="gmail_msg">
+ return Result;<br class="gmail_msg"></blockquote><div><br></div><div>This is again really inefficient.</div><div><br></div><div>First, you build the full Original set of things, but then throw away the tail. And second you are copying things from small vector to small vector everywhere here.</div><div><br></div><div>This really feel like it should be linearly filling in the provider for each byte of the bytes that survived to be used, with no copying or other movement. That way you can only recurse through the byte providers you actually care about, etc. Does that make sense?</div><div><br></div><div>Or maybe some other algorithm. But fundamentally, I think this recursive walk will need to be significantly re-worked.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">+ // Calculate byte providers for the OR we are looking at<br class="gmail_msg">
+ auto Res = collectByteProviders(SDValue(N, 0));<br class="gmail_msg">
+ if (!Res)<br class="gmail_msg">
+ return SDValue();<br class="gmail_msg">
+ auto &Bytes = Res.getValue();<br class="gmail_msg">
+ unsigned ByteWidth = Bytes.size();<br class="gmail_msg">
+ assert(VT.getSizeInBits() == ByteWidth * 8 && "sanity");<br class="gmail_msg">
+<br class="gmail_msg">
+ auto LittleEndianByteAt = [](unsigned BW, unsigned i) { return i; };<br class="gmail_msg">
+ auto BigEndianByteAt = [](unsigned BW, unsigned i) { return BW - i - 1; };<br class="gmail_msg">
+<br class="gmail_msg">
+ Optional<BaseIndexOffset> Base;<br class="gmail_msg">
+ SDValue Chain;<br class="gmail_msg">
+<br class="gmail_msg">
+ SmallSet<LoadSDNode *, 8> Loads;<br class="gmail_msg">
+ LoadSDNode *FirstLoad = nullptr;<br class="gmail_msg">
+<br class="gmail_msg">
+ // Check if all the bytes of the OR we are looking at are loaded from the same<br class="gmail_msg">
+ // base address. Collect bytes offsets from Base address in ByteOffsets.<br class="gmail_msg">
+ SmallVector<int64_t, 4> ByteOffsets(ByteWidth);<br class="gmail_msg">
+ for (unsigned i = 0; i < ByteWidth; i++) {<br class="gmail_msg">
+ // All the bytes must be loaded from memory<br class="gmail_msg">
+ if (Bytes[i].Kind != ByteProvider::Memory)<br class="gmail_msg">
+ return SDValue();<br class="gmail_msg"></blockquote><div><br></div><div>This again seems especially wasteful. Above you carefully built a structure tracking which bytes were zeroed and which came from memory, but you don't need all of that. You just need the bytes from memory and you could early exit if they aren't from memory.</div><div><br></div><div>Alternatively, you could implement something that does the full load and mask off the things that need to be zero. But that would be a more complex cost model and tradeoff space so it might not be worth it.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">==============================================================================<br class="gmail_msg">
--- llvm/trunk/test/CodeGen/ARM/load-combine-big-endian.ll (added)<br class="gmail_msg">
+++ llvm/trunk/test/CodeGen/ARM/load-combine-big-endian.ll Tue Dec 13 08:21:14 2016<br class="gmail_msg">
@@ -0,0 +1,234 @@<br class="gmail_msg">
+; RUN: llc < %s -mtriple=armeb-unknown | FileCheck %s<br class="gmail_msg">
+; RUN: llc < %s -mtriple=arm64eb-unknown | FileCheck %s --check-prefix=CHECK64<br class="gmail_msg">
+<br class="gmail_msg">
+; i8* p; // p is 4 byte aligned<br class="gmail_msg">
+; ((i32) p[0] << 24) | ((i32) p[1] << 16) | ((i32) p[2] << 8) | (i32) p[3]<br class="gmail_msg">
+define i32 @load_i32_by_i8_big_endian(i32*) {<br class="gmail_msg">
+; CHECK-LABEL: load_i32_by_i8_big_endian:<br class="gmail_msg">
+; CHECK: ldr r0, [r0]<br class="gmail_msg">
+; CHECK-NEXT: mov pc, lr<br class="gmail_msg">
+<br class="gmail_msg">
+; CHECK64-LABEL: load_i32_by_i8_big_endian:<br class="gmail_msg">
+; CHECK64: ldr w0, [x0]<br class="gmail_msg">
+; CHECK64-NEXT: ret<br class="gmail_msg">
+ %2 = bitcast i32* %0 to i8*<br class="gmail_msg">
+ %3 = load i8, i8* %2, align 4<br class="gmail_msg"></blockquote><div><br></div><div>Please *always* use named values in your tests. otherwise making minor updates is incredibly painful as you have to renumber all subsequent values.</div></div></div>