[PATCH] D41330: [X86] Reduce Store Forward Block issues in HW

Wed Dec 20 07:08:26 PST 2017

lsaba added inline comments.

================
Comment at: lib/Target/X86/X86FixupSFB.cpp:32
+// to memory, but this requires we know all stores that alias
+// the loaded location in memory, AA is too conservative.
+
----------------
hfinkel wrote:
> > AA is too conservative.
> 
> Did you try turning it on? :-)
> 
> If you overload useAA() in X86Subtarget, then you'll actually get non-trivial AA in the backend. Given all of the recent work on adding scheduling models, we should probably experiment with this again.
> 
> That having been said, it relies on having good MMOs. If we don't, then you'll have trouble using it for this purpose until that's fixed.
Perhaps it's better to say that the solution approach suggested in the comment is more conservative, and not the AA, I will try to explain what I meant with an example:
  %struct.S = type { i32, i32, i32, i32 }
  ; Function Attrs: nounwind uwtable 
  define void @test_conditional_block(%struct.S* nocapture %s1, 
  %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* 
  nocapture readonly %s4) local_unnamed_addr #0 {
  entry:
    %b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
    store i32 %x, i32* %b, align 4
    %0 = bitcast %struct.S* %s3 to i8*
    %1 = bitcast %struct.S* %s4 to i8*
    tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
    %2 = bitcast %struct.S* %s2 to i8*
    %3 = bitcast %struct.S* %s1 to i8*
    tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
    ret void
  }

the currently generated code: 

  movl %edx, 4(%rdi) // blocking store
  movups (%r8), %xmm0
  movups %xmm0, (%rcx)
  movups (%rdi), %xmm0 // blocked load
  movups %xmm0, (%rsi)

If we were to solve the problem by loading the old value, then storing like the comment suggests:
    movups (%rdi), %xmm0 // blocked load
    vinsert %edx, xmm0, 1 //store b to xmm0 at index 1 
    movups %xmm0, (%rsi)

We would have to prove that the previous copy from s3 to s4:
    movups (%r8), %xmm0
    movups %xmm0, (%rcx)
does not alias or must alias (and in that case do a similar handling for it) the copy from s1 to s2, while the AA would naturally assume they may alias in this case.
On the other hand, the proposed solution in this patch (breaking the memcpy) will be correct regardless to the memory relation between s3, s4 and s1,s2.

I am not too familiar with the capabilities of the backend AA, I did not try turning it on :) correct me if i'm wrong, but relying on it and using the approach in the comment will yield a more conservative optimization than the one implemented here, at least for the trivial cases.

In general, the solution in the comment gets more complicated if the blocking store is in a predecessing block, in that case we'll need to prove safety in all predecessing blocks, or hoist the memcpy for conditional blocks

================
Comment at: lib/Target/X86/X86FixupSFB.cpp:224
+// TODO: Not sure this is the best way to get these sizes
+// but relying on MachineMemOperand does not always work.
+#define MOV128SZ 16
----------------
hfinkel wrote:
> Can you elaborate on what you're seeing? Missing MMOs are something we should fix. If the MMO size is too small, that's a bug (i.e., can cause miscompiles from invalid instruction scheduling). MMO sizes that are too big should also be fixed.
You are right. I saw missing MMOs, I will try to fix the ones I encountered, but it doesn't guarantee there won't be more in code I did not reach so i'm not sure it would be safe to remove those defines

https://reviews.llvm.org/D41330