[PATCH] D59129: [SROA] WIP: Lowering alloca is not always beneficial

Mon Mar 11 09:04:33 PDT 2019

SjoerdMeijer added a comment.

Hi @RKSimon, @efriedma, thank you very much for looking at this!

I will address the minor issues, but first about:

> How hard would it be to try to handle your testcase late instead?

I think I've explored most places where intrinsics/alignment are handled: from Clang, InstCombine, to the LoadStoreOptimizer, to see where I could e.g. patch up the alignment problem. But either this analysis and rewrite was difficult to fit in, but perhaps more importantly,  I came to the conclusion that here in SROA is where the rewrites happen that's causing issues later on. Thus, I think it makes to make a different decision here. But perhaps I've missed something, so that was one reason to post this work-in-progress patch. And you're absolutely right about this:

> If we are going to do this, we probably don't want to disable SROA completely for allocas that contain a memcpy source/destination, just avoid promoting the specific slices where it's relevant.

We definitely don't want to disable SROA completely. If you look at `shouldExpand`, you see that a decision is made per alloca slice. So, it's not a blunt way to completely disable things, or at least, that's definitely not the idea.

To make an informed decision that memcpy's are expensive, we need 3 things:

1. get the cost of a c-library intrinsic, which is what D59014 <https://reviews.llvm.org/D59014> is enabling,
2. model the cost of a memcpy, which is what's currently missing and still need to add, and then
3. This SROA patch which just uses 1) and 2). Thus, I don't think at this point that this patch needs much changing or more code, except of course some tests.

To illustrate that "patching up the alignment" later on is quite difficult, this is the IR that comes out of SROA:

  define hidden void @foo(i8* nocapture readonly %src) local_unnamed_addr #0 {
  entry:
    %tmp.sroa.0.sroa.0 = alloca [8 x i8]
    %tmp.sroa.0.sroa.5 = alloca [8 x i8]
    %tmp.sroa.0.sroa.0.0..sroa_idx17 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
    call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx17)
    %tmp.sroa.0.sroa.5.0..sroa_idx15 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
    call void @llvm.lifetime.start.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx15)
    %tmp.sroa.0.sroa.0.0..sroa_idx9 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
    call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.0.0..sroa_idx9, i8* align 1 %src, i32 8, i1 false)
    %tmp.sroa.0.sroa.3.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 8
    %tmp.sroa.0.sroa.3.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.3.0.src.sroa_idx to i64*
    %tmp.sroa.0.sroa.3.0.copyload = load i64, i64* %tmp.sroa.0.sroa.3.0.src.sroa_cast, align 1
    %tmp.sroa.0.sroa.4.0.src.sroa_idx = getelementptr inbounds i8, i8* %src, i32 16
    %tmp.sroa.0.sroa.4.0.src.sroa_cast = bitcast i8* %tmp.sroa.0.sroa.4.0.src.sroa_idx to i64*
    %tmp.sroa.0.sroa.4.0.copyload = load i64, i64* %tmp.sroa.0.sroa.4.0.src.sroa_cast, align 1
    %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx = getelementptr inbounds i8, i8* %src, i32 24
    %tmp.sroa.0.sroa.5.0..sroa_idx13 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
    call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %tmp.sroa.0.sroa.5.0..sroa_idx13, i8* align 1 %tmp.sroa.0.sroa.5.0.src.sroa_raw_idx, i32 8, i1 false)
    %call = call %struct.data_2_t* bitcast (%struct.data_2_t* (...)* @getData2 to %struct.data_2_t* ()*)() #3
    %0 = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 0
    store i64 %tmp.sroa.0.sroa.3.0.copyload, i64* %0, align 8
    %tmp.sroa.0.16..sroa_idx = getelementptr inbounds %struct.data_2_t, %struct.data_2_t* %call, i32 0, i32 1
    store i64 %tmp.sroa.0.sroa.4.0.copyload, i64* %tmp.sroa.0.16..sroa_idx, align 8
    %tmp.sroa.0.sroa.0.0..sroa_idx18 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.0, i32 0, i32 0
    call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.0.0..sroa_idx18)
    %tmp.sroa.0.sroa.5.0..sroa_idx16 = getelementptr inbounds [8 x i8], [8 x i8]* %tmp.sroa.0.sroa.5, i32 0, i32 0
    call void @llvm.lifetime.end.p0i8(i64 8, i8* %tmp.sroa.0.sroa.5.0..sroa_idx16)
    ret void
  }

The main problem here is the `align 1` on the i64 loads, and again, that's created here and thus preventing that from happening here in SROA seems to make most sense to me.

There's actually another problem with this IR, which is that all these copyloads are emitted first by SROA, consecutively, creating huge live ranges between them and the store instructions. Thus creating another problem (when the example is a bit bigger): spilling. A more sensible output would be to follow more the original program-order. I think that's what SROA tries to do, but fails quite bad in these examples. But anyway, this is not the biggest problem now, and it is hidden when we not always lower allocas.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59129/new/

https://reviews.llvm.org/D59129