[llvm-dev] [RFC] Aggreate load/store, proposed plan

deadal nix via llvm-dev llvm-dev at lists.llvm.org
Sun Aug 23 00:50:23 PDT 2015


I played a bit with the alloca based approach. Here is a simple version of
it:

http://reviews.llvm.org/D12269

Overall, it seems to play nicely with the rest of LLVM.

2015-08-21 15:27 GMT-07:00 Hal Finkel <hfinkel at anl.gov>:

> ----- Original Message -----
> > From: "Philip Reames" <listmail at philipreames.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>, "deadal nix" <deadalnix at gmail.com>
> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> > Sent: Friday, August 21, 2015 5:20:34 PM
> > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >
> >
> >
> > On 08/21/2015 02:46 PM, Hal Finkel via llvm-dev wrote:
> > > ----- Original Message -----
> > >> From: "deadal nix" <deadalnix at gmail.com>
> > >> To: "Hal Finkel" <hfinkel at anl.gov>
> > >> Cc: "Mehdi Amini" <mehdi.amini at apple.com>, "llvm-dev"
> > >> <llvm-dev at lists.llvm.org>
> > >> Sent: Friday, August 21, 2015 1:24:04 AM
> > >> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > >>
> > >> I've done some investigation. The proposed approach seems to work
> > >> well on struct with no padding, but the optimizer gets confused
> > >> when
> > >> the struct has some. That's a bummer but that is already improving
> > >> things, so I'd rather have that than nothing.
> > >>
> > >> Alternatively, I think I have, in theory only so far, that may be
> > >> better. The idea is to create an alloca, memcpy the aggregate in
> > >> it,
> > >> and transform extract values into loads, and insertvalues into an
> > >> optional memecy into a new alloca (if the old value is still alive
> > >> after the insertvalue) + a store. There is some extra trickery one
> > >> need to do for call arguments, ret and phi nodes, but overall, I
> > >> think this can work. This approach would rely on SROA to generate
> > >> something sensible. As it is going to do massive substitutions,
> > >> I'm
> > >> not sure InstCombine would be the right place to do it. Maybe its
> > >> own pass ?
> > >>
> > > I think this might make more sense; it sounds closer to the
> > > transformation that Clang does for C-level aggregates, and should
> > > play nicer with SROA, etc. I'm not entirely sure where it should
> > > go, but feel free to prototype it as its own pass, and if it works
> > > well, we'll worry about pipeline placement.
> > >
> > >> That wouldn't work for atomic/volatile, but I think that would
> > >> work
> > >> nicely for regular load/stores.
> > >>
> > > We only support atomics on integer pointee types anyway, so atomics
> > > are not relevant.
> > OT: I plan to change this at some point.  In particular, I need
> > atomic
> > float and double access.  Doing the bitcasts in the frontend works,
> > but
> > given there's an obvious lowering, I don't see why we shouldn't do
> > this
> > within LLVM.
>
> I agree; I'd like them for float/double as well.
>
>  -Hal
>
> > >
> > >   -Hal
> > >
> > >>
> > >> How does that sound ?
> > >>
> > >>
> > >>
> > >>
> > >> 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> > >>
> > >>
> > >> ----- Original Message -----
> > >>> From: "deadal nix" < deadalnix at gmail.com >
> > >>> To: "Hal Finkel" < hfinkel at anl.gov >
> > >>> Cc: "Mehdi Amini" < mehdi.amini at apple.com >, "llvm-dev" <
> > >>> llvm-dev at lists.llvm.org >
> > >>> Sent: Thursday, August 20, 2015 4:09:17 PM
> > >>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > >>>
> > >>
> > >>> Problem :
> > >>>
> > >>> Many languages define aggregates and some way to manipulate them.
> > >>> LLVM define aggregates types (arrays and structs) to handle them.
> > >>> However, when aggregate are loaded or stored, LLVM will simply
> > >>> ignore these up to the legalization in the backend. This lead to
> > >>> many misoptimizations. Most frontend are using a set of trick to
> > >>> work around this limtation, but that an undesirable situation as
> > >>> it
> > >>> increase the work required to write a front end. Ideally that
> > >>> work
> > >>> should be done once by LLVM instead of every time by each
> > >>> frontend.
> > >>>
> > >>> In previous discussion on the subject, many LLVM user have
> > >>> expressed
> > >>> interest in being able to use aggregate memory access. In
> > >>> addition,
> > >>> it is likely that it would have reduced the workload of some
> > >>> existing frontends.
> > >>>
> > >>> The proposed solution is to transform aggregate loads and stores
> > >>> to
> > >>> something that the LLVM toolcahin already understand and is able
> > >>> to
> > >>> work with.
> > >>>
> > >>> The proposed solution will use InstCombine to do the
> > >>> transformation as it is done early and will allow subsequent
> > >>> passes
> > >>> to work with something familiar (basically, canonicalization).
> > >>>
> > >>>
> > >>> Proposed solution :
> > >>>
> > >>>
> > >>> Aggregate load and store are turned into aggregate load and store
> > >>> of
> > >>> a scalar of the same size and alignement. Binary manipulation,
> > >>> like
> > >>> mask and shift, are used to build the aggregate from the scalar
> > >>> after loading and the aggregate to a scalar when storing.
> > >>>
> > >>>
> > >>> For instance, the following IR (extracted from a D frontend) :
> > >>>
> > >>> %B__vtbl = type { i8*, i32 (%B*)* }
> > >>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > >>>
> > >>> %0 = tail call i8* @allocmemory(i64 32)
> > >>>
> > >>> %1 = bitcast i8* %0 to %B*
> > >>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > >>>
> > >>>
> > >>> Would be canonized into :
> > >>> %0 = tail call i8* @allocmemory(i64 32)
> > >>> %1 = bitcast i8* %0 to i128*
> > >>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> > >>> i64)
> > >>> to i128), i128 774763251095801167872), i128* %1, align 8
> > >>>
> > >> As I've said before, we really should introduce some useful
> > >> canonicalization for these, and so I support the investigations in
> > >> this area.
> > >>
> > >> Have you verified that InstCombine that undo this transformation
> > >> reasonably? Specifically, if you have a bunch of insertvalues, and
> > >> an aggregate store in one function, and a load in another function
> > >> with a bunch of extractvalues, which gets inlined into the
> > >> function
> > >> with the store, does everything get combined appropriately to
> > >> forward the values directly regardless of how many of the members
> > >> are extracted and in what order?
> > >>
> > >> -Hal
> > >>
> > >>
> > >>
> > >>>
> > >>> Which the rest of the LLVM pipeline can work with.
> > >>>
> > >>>
> > >>>
> > >>> Limitations :
> > >>>
> > >>>
> > >>> 1/ This solution will not handle properly large (tens of
> > >>> kilobytes)
> > >>> aggregates. It is an accepted limitation, both for this proposal
> > >>> and
> > >>> other part of the pipeline that handle aggregates. Optionally,
> > >>> checks can be added both for this canonicalization and SROA to
> > >>> disable them on very large aggregates as to avoid wasting work
> > >>> that
> > >>> won't yield good codegen at the end anyway. This limitation
> > >>> should
> > >>> not be a blocker as most aggregate are fairly small. For
> > >>> instance,
> > >>> some language make heavy use of fat pointers, and would greatly
> > >>> benefit from this canonicalization.
> > >>>
> > >>>
> > >>> 2/ This solution will generate loads and stores of value that may
> > >>> not
> > >>> be natively supported by the hardware. The hardware do not
> > >>> natively
> > >>> support aggregate to begin with, so both original IR and
> > >>> canonized
> > >>> IR will require optimization. This is not ideal, but the
> > >>> canonicalization is a plus for 2 reasons:
> > >>> - A subset of these memory access won't need canonicalization
> > >>> anymore.
> > >>>
> > >>> - Other passes in LLVM will be able to work with these load and
> > >>> perform adequate transformations.
> > >>>
> > >>>
> > >>>
> > >>> Possible alternatives :
> > >>>
> > >>>
> > >>> In order to mitigate 1/ it is possible to gate the
> > >>> canonicalization
> > >>> to aggregate under a certain size. This essentially avoiding to
> > >>> do
> > >>> work that will lead to bad codegen no matter what.
> > >>>
> > >>> In order to mitigate 2/, it is possible to slice aggregates loads
> > >>> and
> > >>> stores according to the target's data layout. This CANNOT be
> > >>> implemented for atomic/volatile as it would change semantic, but
> > >>> can
> > >>> be done for regulars ones, which are the most commons.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Do that looks better as an RFC ?
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> > >>>
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "Mehdi Amini via llvm-dev" < llvm-dev at lists.llvm.org >
> > >>>> To: "deadal nix" < deadalnix at gmail.com >
> > >>>> Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
> > >>>> Sent: Wednesday, August 19, 2015 7:24:28 PM
> > >>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> To be sure, because the RFC below is not detailed and assume
> > >>>> everyone
> > >>>> knows about all the emails from 10 months ago,
> > >>> I agree. The RFC needs to summarize the problems and the
> > >>> potential
> > >>> solutions.
> > >>>
> > >>>> is there more to do
> > >>>> than what is proposed in http://reviews.llvm.org/D9766 ?
> > >>>>
> > >>>> So basically the proposal is that *InstCombine*
> > >>> I think that fixing this early in the optimizer makes sense
> > >>> (InstCombine, etc.). This seems little different from any other
> > >>> canonicalization problem. These direct aggregate IR values are
> > >>> valid
> > >>> IR, but not our preferred canonical form, so we should transform
> > >>> the
> > >>> IR, when possible, into our preferred canonical form.
> > >>>
> > >>> -Hal
> > >>>
> > >>>
> > >>>
> > >>>> turns aggregate
> > >>>> load/store into a load/store using an integer of equivalent size
> > >>>> and
> > >>>> insert the correct bitcast before/after, right?
> > >>>>
> > >>>> Example is:
> > >>>>
> > >>>> %0 = tail call i8* @allocmemory(i64 32)
> > >>>> %1 = bitcast i8* %0 to %B*
> > >>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > >>>>
> > >>>> into:
> > >>>>
> > >>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> > >>>> i64)
> > >>>> to i128), i128 774763251095801167872), i128* %1, align 8
> > >>>>
> > >>>> Where the aggregate is:
> > >>>>
> > >>>> %B__vtbl = type { i8*, i32 (%B*)* }
> > >>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > >>>>
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> —
> > >>>> Mehdi
> > >>>>
> > >>>>
> > >>>>> On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> > >>>>> < llvm-dev at lists.llvm.org > wrote:
> > >>>>>
> > >>>>> It is pretty clear people need this. Let's get this moving.
> > >>>>>
> > >>>>> I'll try to sum up the point that have been made and I'll try
> > >>>>> to
> > >>>>> address them carefully.
> > >>>>>
> > >>>>> 1/ There is no good solution for large aggregates.
> > >>>>> That is true. However, I don't think this is a reason to not
> > >>>>> address smaller aggregates, as they appear to be needed.
> > >>>>> Realistically, the proportion of aggregates that are very large
> > >>>>> is
> > >>>>> small, and there is no expectation that such a thing would map
> > >>>>> nicely to the hardware anyway (the hardware won't have enough
> > >>>>> registers to load it all anyway). I do think this is reasonable
> > >>>>> to
> > >>>>> expect a reasonable handling of relatively small aggregates
> > >>>>> like
> > >>>>> fat pointers while accepting that larges ones will be
> > >>>>> inefficient.
> > >>>>>
> > >>>>> This limitation is not unique to the current discussion, as
> > >>>>> SROA
> > >>>>> suffer from the same limitation.
> > >>>>> It is possible to disable to transformation for aggregates that
> > >>>>> are
> > >>>>> too large if this is too big of a concern. It should maybe also
> > >>>>> be
> > >>>>> done for SROA.
> > >>>>>
> > >>>>> 2/ Slicing the aggregate break the semantic of atomic/volatile.
> > >>>>> That is true. It means slicing the aggregate should not be done
> > >>>>> for
> > >>>>> atomic/volatile. It doesn't mean this should not be done for
> > >>>>> regular ones as it is reasonable to handle atomic/volatile
> > >>>>> differently. After all, they have different semantic.
> > >>>>>
> > >>>>> 3/ Not slicing can create scalar that aren't supported by the
> > >>>>> target. This is undesirable.
> > >>>>> Indeed. But as always, the important question is compared to
> > >>>>> what
> > >>>>> ?
> > >>>>>
> > >>>>> The hardware has no notion of aggregate, so an aggregate or a
> > >>>>> large
> > >>>>> scalar ends up both requiring legalization. Doing the
> > >>>>> transformation is still beneficial :
> > >>>>> - Some aggregates will generate valid scalars. For such
> > >>>>> aggregate,
> > >>>>> this is 100% win.
> > >>>>> - For aggregate that won't, the situation is still better as
> > >>>>> various optimization passes will be able to handle the load in
> > >>>>> a
> > >>>>> sensible manner.
> > >>>>> - The transformation never make the situation worse than it is
> > >>>>> to
> > >>>>> begin with.
> > >>>>>
> > >>>>> On previous discussion, Hal Finkel seemed to think that the
> > >>>>> scalar
> > >>>>> solution is preferable to the slicing one.
> > >>>>>
> > >>>>> Is that a fair assessment of the situation ? Considering all of
> > >>>>> this, I think the right path forward is :
> > >>>>> - Go for the scalar solution in the general case.
> > >>>>> - If that is a problem, the slicing approach can be used for
> > >>>>> non
> > >>>>> atomic/volatile.
> > >>>>> - If necessary, disable the transformation for very large
> > >>>>> aggregates (and consider doing so for SROA as well).
> > >>>>>
> > >>>>> Do we have a plan ?
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> LLVM Developers mailing list
> > >>>>> llvm-dev at lists.llvm.org
> > >>>>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e=
> > >>>> _______________________________________________
> > >>>> LLVM Developers mailing list
> > >>>> llvm-dev at lists.llvm.org
> > >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >>>>
> > >>> --
> > >>> Hal Finkel
> > >>> Assistant Computational Scientist
> > >>> Leadership Computing Facility
> > >>> Argonne National Laboratory
> > >>>
> > >>>
> > >> --
> > >> Hal Finkel
> > >> Assistant Computational Scientist
> > >> Leadership Computing Facility
> > >> Argonne National Laboratory
> > >>
> > >>
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150823/c2b55924/attachment.html>


More information about the llvm-dev mailing list