[llvm-dev] RFC: SROA for method argument

Sun May 14 23:19:06 PDT 2017

I agree with Reid's suggestion. We can ignore the first stores of arguments
into allocas in the SROA analysis.

Orthogonal to Reid's suggestion, I implemented another approach to resolve
the problem.
This approach is more generic (not limited to arguments), simpler and ABI
independent.

Currently, SROA splits loads and stores only when they are accessing the
whole alloca.
I relax this limitation to split a load/store if all other loads and stores
to the alloca are
disjoint to or fully included in the current load/store.
The whole-alloca loads and stores meet this new condition and so they are
still splittable.
Does this approach make sense?

I updated the patch in Phabricator. https://reviews.llvm.org/D32998

Since the new approach is ABI independent, I confirmed that it now works on
both x86 and ppc.
With this SROA optimization, the sample program is compiled into only few
instructions without loop.
Without the optimization, unnecessary loop-carried dependency is introduced
by SROA and
the loop cannot be eliminated by the later optimizers.

-----
Hiroshi Inoue <inouehrs at jp.ibm.com>
IBM Research - Tokyo

Reid Kleckner <rnk at google.com> wrote on 2017/05/13 07:31:19:

> From: Reid Kleckner <rnk at google.com>
> To: "Friedman, Eli" <efriedma at codeaurora.org>
> Cc: Hiroshi 7 Inoue/Japan/IBM at IBMJP, llvm-dev <llvm-dev at lists.llvm.org>
> Date: 2017/05/13 07:32
> Subject: Re: [llvm-dev] RFC: SROA for method argument
>
> I'll propose a different heuristic. SROA should ignore stores of
> arguments into allocas in the entry block when deciding what slices
> to form. Such stores happen exactly once, and are usually coercions
> that we have to do for ABI reasons. SROA should generate code like
> this before promoting allocas to SSA form:
>
> define i32 @func(i64 %r.coerce.0, i64 %r.coerce.1) {
>   %r.slice.0 = alloca i64
>   %r.slice.1 = alloca i32
>   %r.slice.2 = alloca i32
>   store i64 %r.coerce.0, i64* %r.slice.0
>   %r.1.shr = lshr i64 %r.coerce.1, 32
>   %r.1 = trunc i64 %r.1.shr
>   %r.2 = trunc i64 %r.coerce.1
>   store i32 %r.1, i32* %r.slice.1
>   store i32 %r.2, i32* %r.slice.2
>   ...
> }
>
> This is basically "reasoning about the CFG" without actually looking
> at loop info. Stores of arguments in the entry block can't be in a
> loop. Even if they end up in one after inlining, instcombine should
> be able to simplify the {i32,i32}->i64->{i32,i32} code.
>
> On Tue, May 9, 2017 at 10:53 AM, Friedman, Eli via llvm-dev <llvm-
> dev at lists.llvm.org> wrote:
> On 5/9/2017 6:05 AM, Hiroshi 7 Inoue via llvm-dev wrote:
> Hi,
>
> I am working to improve SROA to generate better code when a method
> has a struct in its arguments. I would appreciate it if I could have
> any suggestions or comments on how I can best proceed with this
optimization.
>
> * Problem *
> I observed that LLVM often generates redundant instructions around
> glibc’s istreambuf_iterator. The problem comes from the scalar
> replacement (SROA) for methods with an aggregate as an argument.
> Here is a simplified example in C.
>
> struct record {
> long long a;
> int b;
> int c;
> };
>
> int func(struct record r) {
> for (int i = 0; i < r.c; i++)
> r.b++;
> return r.b;
> }
>
> When updating r.b (or r.c as well), SROA generates redundant
> instructions on some platforms (such as x86_64 and ppc64); here, r.b
> and r.c are packed into one 64-bit GPR when the struct is passed as
> a method argument. The problem is caused when the same memory
> location is accessed by load/store instructions of different types.
> For this example, CLANG generates following IRs to initialize the
> struct for ppc64 and x86_64. For both platforms, the 64-bit value is
> stored into memory allocated by alloca first. Later, the same memory
> location is accessed as 32-bit integer values (r.b and r.c).
>
> for ppc64
> %struct.record = type { i64, i32, i32 }
>
> define signext i32 @ppc64le_func([2 x i64] %r.coerce) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to [2 x i64]*
> store [2 x i64] %r.coerce, [2 x i64]* %0, align 8
> ....
>
> for x86_64
> define i32 @x86_64_func(i64 %r.coerce0, i64 %r.coerce1) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to { i64, i64 }*
> %1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 0
> store i64 %r.coerce0, i64* %1, align 8
> %2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 1
> store i64 %r.coerce1, i64* %2, align 8
> ....
>
> For such code sequence, the current SROA generates instructions to
> update only upper (or lower) half of the 64-bit value when storing
> r.b (or r.c). SROA can split an i64 value into two i32 values under
> some conditions (e.g. when the struct contains only int b and int c
> in this example), but it is not capable of splitting complex cases.
> When there are accesses of mixed type to an alloca, SROA just treats
> the whole alloca as a big integer, and generates PHI nodes
> appropriately.  In many cases, instcombine would then slice up the
> generated PHI nodes to use more appropriate types, but that doesn't
> work out here.  (See InstCombiner::SliceUpIllegalIntegerPHI.)
> Probably the right solution is to make instcombine more aggressive
> here; it's hard to come up with a generally useful transform in SROA
> without reasoning about control flow.
> -Eli
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170515/090ef37f/attachment.html>