[PATCH] [PATCH][SROA]Also slice the STORE when slicing a LOAD in AllocaSliceRewriter

Mon Aug 25 19:50:32 PDT 2014

Hi Chandler,

Sorry that I didn't provide more details. 
Example for the specific case @load_store_i64 in the patch. There is an i64 alloca, then the alloca is stored by two i32 STOREs. Following is the i64 LOAD. Currently, the SROA splits the i64 LOAD, so that  the former two i32 STOREs and the i64 LOAD can be optimized and removed.
The problem is that the split of i64 LOAD introducing additional ZEXT/SHL/AND/OR IRs to handle the following i64 uses. Such additional IRs try to combine two i32 into one i64 and then store it. So I think why don't we store two i32 separately so that such additional IRs can be removed.
I add patch in SROA, because following 3 reasons:
  1) It is SROA that introduces such additional IRs.
  2) Also it is easier to do such optimize in SROA. The LOAD can be sliced means the STORE also can be sliced. We can just split the STORE. If we keep such additional IRs and i64 STORE, the following optimizer or backend optimizer need more efforts to analyze both such additional IRs and STORE. Currently we don't have such similar optimization.
  3) This patch does the same thing as SROA handles memory copy.
      E.g. If the i64 LOAD and STORE IRs in the test case:
              "   %1 = bitcast %struct.point.two.i32* %ptr to i64*
                  %2 = load i64* %ref.tmp, align 8
                  store i64 %2, i64* %1, align 4 "
      are a memory copy as following:
              "     %1 = bitcast %struct.pointt* %0 to i8*
                    %2 = bitcast i64* %ref.tmp to i8*
                    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %1, i8* %2, i64 8, i32 4, i1 false) "
      The memory copy will also be sliced into two i32 LOAD and two i32 STORE.

The test result for @load_store_i64, if we use "opt", will get following code:
opt -S -O3 < input.ll
without the patch:
       %1 = bitcast %struct.point.two.i32* %ptr to i64*
       %ref.tmp.sroa.2.0.insert.ext = zext i32 %a to i64
       %ref.tmp.sroa.2.0.insert.shift = shl nuw i64 %ref.tmp.sroa.2.0.insert.ext, 32
       %ref.tmp.sroa.0.0.insert.insert = or i64 %ref.tmp.sroa.2.0.insert.shift, %ref.tmp.sroa.2.0.insert.ext
       store i64 %ref.tmp.sroa.0.0.insert.insert, i64* %1, align 4
with the patch:
         %ref.tmp.sroa.0.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 0
         %ref.tmp.sroa.2.0..sroa_idx = getelementptr inbounds %struct.point.two.i32* %ptr, i64 0, i32 1
         store i32 %a, i32* %ref.tmp.sroa.0.0..sroa_idx, align 4
         store i32 %a, i32* %ref.tmp.sroa.2.0..sroa_idx, align 4

The second version looks simpler than the first version. Also if we use "llc" to compile them, we'll get following results for AArch64 and X86:
llc -march=aarch64 < input.ll
1st version in AArch64:
         ubfx	x8, x0, #0, #32
         bfi	x8, x8, #32, #32
         str	 x8, [x1]
2nd version in AArch64:
         stp	 w0, w0, [x1]

llc < input.ll
1st version in X86:
         movl	%edi, %eax
         movq	%rax, %rcx
         shlq	$32, %rcx
         orq	%rax, %rcx
         movq	%rcx, (%rsi)
2nd version in X86:
         movl	%edi, (%rsi)
         movl	%edi, 4(%rsi)

Thanks,
-Hao

http://reviews.llvm.org/D4954