[PATCH] D21496: LoadCombine Load Aliasing Fix

Sat Jun 18 19:25:25 PDT 2016

eli.friedman added inline comments.

================
Comment at: /Users/rriddle/Desktop/llvm/llvm/test/Transforms/LoadCombine/load-combine-alias.ll:12
@@ +11,3 @@
+  %i1 = getelementptr inbounds %struct.Str, %struct.Str* %1, i64 0, i32 0
+  store i32 %0, i32* %i1, align 4
+  %add.ptr = getelementptr inbounds i32, i32* %i, i64 1
----------------
rriddle wrote:
> rriddle wrote:
> > eli.friedman wrote:
> > > rriddle wrote:
> > > > eli.friedman wrote:
> > > > > What transform is this testcase checking for?  I don't think you can prove any useful aliasing property between %i, %i1, and %str?
> > > > In the current implementation %1 is being aliased in that store operation which would cause every load chain to try and combine which would prevent the load at %0 from being able to combine with %2, %4, etc.
> > > > 
> > > I'm not sure I follow... is the transform you're trying to perform something like this?
> > > 
> > > ```
> > > void f(int* a, int* b) {
> > >   a[0] = b[0];
> > >   a[1] = b[1];
> > >   a[2] = b[2];
> > >   a[3] = b[3];
> > >   // Combine to "memcpy(a, b, 16)"???
> > > }
> > > ```
> > No, the point of transform is to combine each of the individual loads into a single large one. In this specific test, there are multiple GEP + load pairs coming from %i. Each of those will combine down to a single i128 load. It may be easier to see with this output 
> > 
> > ```
> > define i32 @Load_MultiChain(i32* %i) {
> >   %1 = getelementptr inbounds i32, i32* %i, i64 1
> >   %2 = bitcast i32* %i to i8*
> >   %3 = getelementptr i8, i8* %2, i64 0
> >   %4 = bitcast i8* %3 to i128*
> >   %.combined = load i128, i128* %4, align 4
> >   %combine.extract.shift = lshr i128 %.combined, 32
> >   %combine.extract.trunc1 = trunc i128 %combine.extract.shift to i32
> >   %5 = load i32, i32* %1, align 4
> >   %combine.extract.trunc = trunc i128 %.combined to i32
> >   %6 = load i32, i32* %i, align 4
> >   %7 = getelementptr inbounds i32, i32* %i, i64 2
> >   %combine.extract.shift2 = lshr i128 %.combined, 64
> >   %combine.extract.trunc3 = trunc i128 %combine.extract.shift2 to i32
> >   %8 = load i32, i32* %7, align 4
> >   %9 = getelementptr inbounds i32, i32* %i, i64 3
> >   %combine.extract.shift4 = lshr i128 %.combined, 96
> >   %combine.extract.trunc5 = trunc i128 %combine.extract.shift4 to i32
> >   %10 = load i32, i32* %9, align 4
> >   %11 = getelementptr inbounds i32, i32* %i, i64 4
> >   %12 = load i32, i32* %11, align 4
> >   %13 = getelementptr inbounds i32, i32* %i, i64 5
> >   %14 = load i32, i32* %13, align 4
> >   %15 = add nsw i32 %combine.extract.trunc, %14
> >   ret i32 %15
> > }
> > ```
> > In the above every load that originated from i has been combined into a single i128 load. This test is not about transforming into the most optimal result it is testing to make sure that a referenced load chain does not affect all of the other load chains in the basic block. The problem with the current implementation is that it is too conservative when it comes to a write operation that aliases a load. When it reaches such an instruction it tries to combine all of the load chains at that point. This reduces the amount of loads that will be combined if it combines every chain when it reaches a store that references a load.
> The line at 12 makes more sense when you visualize the load chains at that point. I will visualize the load chains in this format {base_ptr : referenced loads by instruction name }.
> So at line 12 the load chains are as follows : 
> 
> ```
> { %i : %0 } 
> { %str : %1}
> ```
> The store instruction at line 12 is storing into %i1 which aliases the base ptr %str. This causes every load chain to try and combine, but at this point each only has 1 load; so there is no combine operation performed. The problem with this is that the chain with baseptr %i wasn't referenced by that store operation so why are we trying to combine that chain at this point.
The problem isn't whether it's optimal... the problem is that the transformation doesn't preserve the semantics of the code.  For my C code, -load-combine (plus instcombine to clean up the result) gives:

```
define void @f(i32* nocapture %a, i32* nocapture readonly %b) local_unnamed_addr #0 {
entry:
  %0 = bitcast i32* %b to i128*
  %.combined = load i128, i128* %0, align 4
  %combine.extract.trunc = trunc i128 %.combined to i32
  store i32 %combine.extract.trunc, i32* %a, align 4, !tbaa !1
  %combine.extract.shift = lshr i128 %.combined, 32
  %combine.extract.trunc1 = trunc i128 %combine.extract.shift to i32
  %arrayidx3 = getelementptr inbounds i32, i32* %a, i64 1
  store i32 %combine.extract.trunc1, i32* %arrayidx3, align 4, !tbaa !1
  %combine.extract.shift2 = lshr i128 %.combined, 64
  %combine.extract.trunc3 = trunc i128 %combine.extract.shift2 to i32
  %arrayidx5 = getelementptr inbounds i32, i32* %a, i64 2
  store i32 %combine.extract.trunc3, i32* %arrayidx5, align 4, !tbaa !1
  %combine.extract.shift4 = lshr i128 %.combined, 96
  %combine.extract.trunc5 = trunc i128 %combine.extract.shift4 to i32
  %arrayidx7 = getelementptr inbounds i32, i32* %a, i64 3
  store i32 %combine.extract.trunc5, i32* %arrayidx7, align 4, !tbaa !1
  ret void
}
```

This is completely, utterly wrong.

http://reviews.llvm.org/D21496