[llvm-dev] [RFC] Aggreate load/store, proposed plan

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 21 15:27:05 PDT 2015


----- Original Message -----
> From: "Philip Reames" <listmail at philipreames.com>
> To: "Hal Finkel" <hfinkel at anl.gov>, "deadal nix" <deadalnix at gmail.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Friday, August 21, 2015 5:20:34 PM
> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> 
> 
> 
> On 08/21/2015 02:46 PM, Hal Finkel via llvm-dev wrote:
> > ----- Original Message -----
> >> From: "deadal nix" <deadalnix at gmail.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "Mehdi Amini" <mehdi.amini at apple.com>, "llvm-dev"
> >> <llvm-dev at lists.llvm.org>
> >> Sent: Friday, August 21, 2015 1:24:04 AM
> >> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >>
> >> I've done some investigation. The proposed approach seems to work
> >> well on struct with no padding, but the optimizer gets confused
> >> when
> >> the struct has some. That's a bummer but that is already improving
> >> things, so I'd rather have that than nothing.
> >>
> >> Alternatively, I think I have, in theory only so far, that may be
> >> better. The idea is to create an alloca, memcpy the aggregate in
> >> it,
> >> and transform extract values into loads, and insertvalues into an
> >> optional memecy into a new alloca (if the old value is still alive
> >> after the insertvalue) + a store. There is some extra trickery one
> >> need to do for call arguments, ret and phi nodes, but overall, I
> >> think this can work. This approach would rely on SROA to generate
> >> something sensible. As it is going to do massive substitutions,
> >> I'm
> >> not sure InstCombine would be the right place to do it. Maybe its
> >> own pass ?
> >>
> > I think this might make more sense; it sounds closer to the
> > transformation that Clang does for C-level aggregates, and should
> > play nicer with SROA, etc. I'm not entirely sure where it should
> > go, but feel free to prototype it as its own pass, and if it works
> > well, we'll worry about pipeline placement.
> >
> >> That wouldn't work for atomic/volatile, but I think that would
> >> work
> >> nicely for regular load/stores.
> >>
> > We only support atomics on integer pointee types anyway, so atomics
> > are not relevant.
> OT: I plan to change this at some point.  In particular, I need
> atomic
> float and double access.  Doing the bitcasts in the frontend works,
> but
> given there's an obvious lowering, I don't see why we shouldn't do
> this
> within LLVM.

I agree; I'd like them for float/double as well.

 -Hal

> >
> >   -Hal
> >
> >>
> >> How does that sound ?
> >>
> >>
> >>
> >>
> >> 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> >>
> >>
> >> ----- Original Message -----
> >>> From: "deadal nix" < deadalnix at gmail.com >
> >>> To: "Hal Finkel" < hfinkel at anl.gov >
> >>> Cc: "Mehdi Amini" < mehdi.amini at apple.com >, "llvm-dev" <
> >>> llvm-dev at lists.llvm.org >
> >>> Sent: Thursday, August 20, 2015 4:09:17 PM
> >>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >>>
> >>
> >>> Problem :
> >>>
> >>> Many languages define aggregates and some way to manipulate them.
> >>> LLVM define aggregates types (arrays and structs) to handle them.
> >>> However, when aggregate are loaded or stored, LLVM will simply
> >>> ignore these up to the legalization in the backend. This lead to
> >>> many misoptimizations. Most frontend are using a set of trick to
> >>> work around this limtation, but that an undesirable situation as
> >>> it
> >>> increase the work required to write a front end. Ideally that
> >>> work
> >>> should be done once by LLVM instead of every time by each
> >>> frontend.
> >>>
> >>> In previous discussion on the subject, many LLVM user have
> >>> expressed
> >>> interest in being able to use aggregate memory access. In
> >>> addition,
> >>> it is likely that it would have reduced the workload of some
> >>> existing frontends.
> >>>
> >>> The proposed solution is to transform aggregate loads and stores
> >>> to
> >>> something that the LLVM toolcahin already understand and is able
> >>> to
> >>> work with.
> >>>
> >>> The proposed solution will use InstCombine to do the
> >>> transformation as it is done early and will allow subsequent
> >>> passes
> >>> to work with something familiar (basically, canonicalization).
> >>>
> >>>
> >>> Proposed solution :
> >>>
> >>>
> >>> Aggregate load and store are turned into aggregate load and store
> >>> of
> >>> a scalar of the same size and alignement. Binary manipulation,
> >>> like
> >>> mask and shift, are used to build the aggregate from the scalar
> >>> after loading and the aggregate to a scalar when storing.
> >>>
> >>>
> >>> For instance, the following IR (extracted from a D frontend) :
> >>>
> >>> %B__vtbl = type { i8*, i32 (%B*)* }
> >>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> >>>
> >>> %0 = tail call i8* @allocmemory(i64 32)
> >>>
> >>> %1 = bitcast i8* %0 to %B*
> >>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> >>>
> >>>
> >>> Would be canonized into :
> >>> %0 = tail call i8* @allocmemory(i64 32)
> >>> %1 = bitcast i8* %0 to i128*
> >>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> >>> i64)
> >>> to i128), i128 774763251095801167872), i128* %1, align 8
> >>>
> >> As I've said before, we really should introduce some useful
> >> canonicalization for these, and so I support the investigations in
> >> this area.
> >>
> >> Have you verified that InstCombine that undo this transformation
> >> reasonably? Specifically, if you have a bunch of insertvalues, and
> >> an aggregate store in one function, and a load in another function
> >> with a bunch of extractvalues, which gets inlined into the
> >> function
> >> with the store, does everything get combined appropriately to
> >> forward the values directly regardless of how many of the members
> >> are extracted and in what order?
> >>
> >> -Hal
> >>
> >>
> >>
> >>>
> >>> Which the rest of the LLVM pipeline can work with.
> >>>
> >>>
> >>>
> >>> Limitations :
> >>>
> >>>
> >>> 1/ This solution will not handle properly large (tens of
> >>> kilobytes)
> >>> aggregates. It is an accepted limitation, both for this proposal
> >>> and
> >>> other part of the pipeline that handle aggregates. Optionally,
> >>> checks can be added both for this canonicalization and SROA to
> >>> disable them on very large aggregates as to avoid wasting work
> >>> that
> >>> won't yield good codegen at the end anyway. This limitation
> >>> should
> >>> not be a blocker as most aggregate are fairly small. For
> >>> instance,
> >>> some language make heavy use of fat pointers, and would greatly
> >>> benefit from this canonicalization.
> >>>
> >>>
> >>> 2/ This solution will generate loads and stores of value that may
> >>> not
> >>> be natively supported by the hardware. The hardware do not
> >>> natively
> >>> support aggregate to begin with, so both original IR and
> >>> canonized
> >>> IR will require optimization. This is not ideal, but the
> >>> canonicalization is a plus for 2 reasons:
> >>> - A subset of these memory access won't need canonicalization
> >>> anymore.
> >>>
> >>> - Other passes in LLVM will be able to work with these load and
> >>> perform adequate transformations.
> >>>
> >>>
> >>>
> >>> Possible alternatives :
> >>>
> >>>
> >>> In order to mitigate 1/ it is possible to gate the
> >>> canonicalization
> >>> to aggregate under a certain size. This essentially avoiding to
> >>> do
> >>> work that will lead to bad codegen no matter what.
> >>>
> >>> In order to mitigate 2/, it is possible to slice aggregates loads
> >>> and
> >>> stores according to the target's data layout. This CANNOT be
> >>> implemented for atomic/volatile as it would change semantic, but
> >>> can
> >>> be done for regulars ones, which are the most commons.
> >>>
> >>>
> >>>
> >>>
> >>> Do that looks better as an RFC ?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "Mehdi Amini via llvm-dev" < llvm-dev at lists.llvm.org >
> >>>> To: "deadal nix" < deadalnix at gmail.com >
> >>>> Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
> >>>> Sent: Wednesday, August 19, 2015 7:24:28 PM
> >>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >>>>
> >>>> Hi,
> >>>>
> >>>> To be sure, because the RFC below is not detailed and assume
> >>>> everyone
> >>>> knows about all the emails from 10 months ago,
> >>> I agree. The RFC needs to summarize the problems and the
> >>> potential
> >>> solutions.
> >>>
> >>>> is there more to do
> >>>> than what is proposed in http://reviews.llvm.org/D9766 ?
> >>>>
> >>>> So basically the proposal is that *InstCombine*
> >>> I think that fixing this early in the optimizer makes sense
> >>> (InstCombine, etc.). This seems little different from any other
> >>> canonicalization problem. These direct aggregate IR values are
> >>> valid
> >>> IR, but not our preferred canonical form, so we should transform
> >>> the
> >>> IR, when possible, into our preferred canonical form.
> >>>
> >>> -Hal
> >>>
> >>>
> >>>
> >>>> turns aggregate
> >>>> load/store into a load/store using an integer of equivalent size
> >>>> and
> >>>> insert the correct bitcast before/after, right?
> >>>>
> >>>> Example is:
> >>>>
> >>>> %0 = tail call i8* @allocmemory(i64 32)
> >>>> %1 = bitcast i8* %0 to %B*
> >>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> >>>>
> >>>> into:
> >>>>
> >>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> >>>> i64)
> >>>> to i128), i128 774763251095801167872), i128* %1, align 8
> >>>>
> >>>> Where the aggregate is:
> >>>>
> >>>> %B__vtbl = type { i8*, i32 (%B*)* }
> >>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> —
> >>>> Mehdi
> >>>>
> >>>>
> >>>>> On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> >>>>> < llvm-dev at lists.llvm.org > wrote:
> >>>>>
> >>>>> It is pretty clear people need this. Let's get this moving.
> >>>>>
> >>>>> I'll try to sum up the point that have been made and I'll try
> >>>>> to
> >>>>> address them carefully.
> >>>>>
> >>>>> 1/ There is no good solution for large aggregates.
> >>>>> That is true. However, I don't think this is a reason to not
> >>>>> address smaller aggregates, as they appear to be needed.
> >>>>> Realistically, the proportion of aggregates that are very large
> >>>>> is
> >>>>> small, and there is no expectation that such a thing would map
> >>>>> nicely to the hardware anyway (the hardware won't have enough
> >>>>> registers to load it all anyway). I do think this is reasonable
> >>>>> to
> >>>>> expect a reasonable handling of relatively small aggregates
> >>>>> like
> >>>>> fat pointers while accepting that larges ones will be
> >>>>> inefficient.
> >>>>>
> >>>>> This limitation is not unique to the current discussion, as
> >>>>> SROA
> >>>>> suffer from the same limitation.
> >>>>> It is possible to disable to transformation for aggregates that
> >>>>> are
> >>>>> too large if this is too big of a concern. It should maybe also
> >>>>> be
> >>>>> done for SROA.
> >>>>>
> >>>>> 2/ Slicing the aggregate break the semantic of atomic/volatile.
> >>>>> That is true. It means slicing the aggregate should not be done
> >>>>> for
> >>>>> atomic/volatile. It doesn't mean this should not be done for
> >>>>> regular ones as it is reasonable to handle atomic/volatile
> >>>>> differently. After all, they have different semantic.
> >>>>>
> >>>>> 3/ Not slicing can create scalar that aren't supported by the
> >>>>> target. This is undesirable.
> >>>>> Indeed. But as always, the important question is compared to
> >>>>> what
> >>>>> ?
> >>>>>
> >>>>> The hardware has no notion of aggregate, so an aggregate or a
> >>>>> large
> >>>>> scalar ends up both requiring legalization. Doing the
> >>>>> transformation is still beneficial :
> >>>>> - Some aggregates will generate valid scalars. For such
> >>>>> aggregate,
> >>>>> this is 100% win.
> >>>>> - For aggregate that won't, the situation is still better as
> >>>>> various optimization passes will be able to handle the load in
> >>>>> a
> >>>>> sensible manner.
> >>>>> - The transformation never make the situation worse than it is
> >>>>> to
> >>>>> begin with.
> >>>>>
> >>>>> On previous discussion, Hal Finkel seemed to think that the
> >>>>> scalar
> >>>>> solution is preferable to the slicing one.
> >>>>>
> >>>>> Is that a fair assessment of the situation ? Considering all of
> >>>>> this, I think the right path forward is :
> >>>>> - Go for the scalar solution in the general case.
> >>>>> - If that is a problem, the slicing approach can be used for
> >>>>> non
> >>>>> atomic/volatile.
> >>>>> - If necessary, disable the transformation for very large
> >>>>> aggregates (and consider doing so for SROA as well).
> >>>>>
> >>>>> Do we have a plan ?
> >>>>>
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> llvm-dev at lists.llvm.org
> >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e=
> >>>> _______________________________________________
> >>>> LLVM Developers mailing list
> >>>> llvm-dev at lists.llvm.org
> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>>
> >>> --
> >>> Hal Finkel
> >>> Assistant Computational Scientist
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>>
> >>>
> >> --
> >> Hal Finkel
> >> Assistant Computational Scientist
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >>
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


More information about the llvm-dev mailing list