[llvm-dev] [RFC] Aggreate load/store, proposed plan

Fri Aug 21 15:05:13 PDT 2015

Awesome, will work a prototype.

2015-08-21 14:46 GMT-07:00 Hal Finkel <hfinkel at anl.gov>:

> ----- Original Message -----
> > From: "deadal nix" <deadalnix at gmail.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Mehdi Amini" <mehdi.amini at apple.com>, "llvm-dev" <
> llvm-dev at lists.llvm.org>
> > Sent: Friday, August 21, 2015 1:24:04 AM
> > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> >
> > I've done some investigation. The proposed approach seems to work
> > well on struct with no padding, but the optimizer gets confused when
> > the struct has some. That's a bummer but that is already improving
> > things, so I'd rather have that than nothing.
> >
> > Alternatively, I think I have, in theory only so far, that may be
> > better. The idea is to create an alloca, memcpy the aggregate in it,
> > and transform extract values into loads, and insertvalues into an
> > optional memecy into a new alloca (if the old value is still alive
> > after the insertvalue) + a store. There is some extra trickery one
> > need to do for call arguments, ret and phi nodes, but overall, I
> > think this can work. This approach would rely on SROA to generate
> > something sensible. As it is going to do massive substitutions, I'm
> > not sure InstCombine would be the right place to do it. Maybe its
> > own pass ?
> >
>
> I think this might make more sense; it sounds closer to the transformation
> that Clang does for C-level aggregates, and should play nicer with SROA,
> etc. I'm not entirely sure where it should go, but feel free to prototype
> it as its own pass, and if it works well, we'll worry about pipeline
> placement.
>
> >
> > That wouldn't work for atomic/volatile, but I think that would work
> > nicely for regular load/stores.
> >
>
> We only support atomics on integer pointee types anyway, so atomics are
> not relevant.
>
>  -Hal
>
> >
> >
> > How does that sound ?
> >
> >
> >
> >
> > 2015-08-20 16:38 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> >
> >
> > ----- Original Message -----
> > > From: "deadal nix" < deadalnix at gmail.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > Cc: "Mehdi Amini" < mehdi.amini at apple.com >, "llvm-dev" <
> > > llvm-dev at lists.llvm.org >
> > > Sent: Thursday, August 20, 2015 4:09:17 PM
> > > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > >
> >
> >
> > > Problem :
> > >
> > > Many languages define aggregates and some way to manipulate them.
> > > LLVM define aggregates types (arrays and structs) to handle them.
> > > However, when aggregate are loaded or stored, LLVM will simply
> > > ignore these up to the legalization in the backend. This lead to
> > > many misoptimizations. Most frontend are using a set of trick to
> > > work around this limtation, but that an undesirable situation as it
> > > increase the work required to write a front end. Ideally that work
> > > should be done once by LLVM instead of every time by each frontend.
> > >
> > > In previous discussion on the subject, many LLVM user have
> > > expressed
> > > interest in being able to use aggregate memory access. In addition,
> > > it is likely that it would have reduced the workload of some
> > > existing frontends.
> > >
> > > The proposed solution is to transform aggregate loads and stores to
> > > something that the LLVM toolcahin already understand and is able to
> > > work with.
> > >
> > > The proposed solution will use InstCombine to do the
> > > transformation as it is done early and will allow subsequent passes
> > > to work with something familiar (basically, canonicalization).
> > >
> > >
> > > Proposed solution :
> > >
> > >
> > > Aggregate load and store are turned into aggregate load and store
> > > of
> > > a scalar of the same size and alignement. Binary manipulation, like
> > > mask and shift, are used to build the aggregate from the scalar
> > > after loading and the aggregate to a scalar when storing.
> > >
> > >
> > > For instance, the following IR (extracted from a D frontend) :
> > >
> > > %B__vtbl = type { i8*, i32 (%B*)* }
> > > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > >
> > > %0 = tail call i8* @allocmemory(i64 32)
> > >
> > > %1 = bitcast i8* %0 to %B*
> > > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > >
> > >
> > > Would be canonized into :
> > > %0 = tail call i8* @allocmemory(i64 32)
> > > %1 = bitcast i8* %0 to i128*
> > > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to i64)
> > > to i128), i128 774763251095801167872), i128* %1, align 8
> > >
> >
> > As I've said before, we really should introduce some useful
> > canonicalization for these, and so I support the investigations in
> > this area.
> >
> > Have you verified that InstCombine that undo this transformation
> > reasonably? Specifically, if you have a bunch of insertvalues, and
> > an aggregate store in one function, and a load in another function
> > with a bunch of extractvalues, which gets inlined into the function
> > with the store, does everything get combined appropriately to
> > forward the values directly regardless of how many of the members
> > are extracted and in what order?
> >
> > -Hal
> >
> >
> >
> > >
> > >
> > > Which the rest of the LLVM pipeline can work with.
> > >
> > >
> > >
> > > Limitations :
> > >
> > >
> > > 1/ This solution will not handle properly large (tens of kilobytes)
> > > aggregates. It is an accepted limitation, both for this proposal
> > > and
> > > other part of the pipeline that handle aggregates. Optionally,
> > > checks can be added both for this canonicalization and SROA to
> > > disable them on very large aggregates as to avoid wasting work that
> > > won't yield good codegen at the end anyway. This limitation should
> > > not be a blocker as most aggregate are fairly small. For instance,
> > > some language make heavy use of fat pointers, and would greatly
> > > benefit from this canonicalization.
> > >
> > >
> > > 2/ This solution will generate loads and stores of value that may
> > > not
> > > be natively supported by the hardware. The hardware do not natively
> > > support aggregate to begin with, so both original IR and canonized
> > > IR will require optimization. This is not ideal, but the
> > > canonicalization is a plus for 2 reasons:
> > > - A subset of these memory access won't need canonicalization
> > > anymore.
> > >
> > > - Other passes in LLVM will be able to work with these load and
> > > perform adequate transformations.
> > >
> > >
> > >
> > > Possible alternatives :
> > >
> > >
> > > In order to mitigate 1/ it is possible to gate the canonicalization
> > > to aggregate under a certain size. This essentially avoiding to do
> > > work that will lead to bad codegen no matter what.
> > >
> > > In order to mitigate 2/, it is possible to slice aggregates loads
> > > and
> > > stores according to the target's data layout. This CANNOT be
> > > implemented for atomic/volatile as it would change semantic, but
> > > can
> > > be done for regulars ones, which are the most commons.
> > >
> > >
> > >
> > >
> > > Do that looks better as an RFC ?
> > >
> > >
> > >
> > >
> > >
> > > 2015-08-19 22:11 GMT-07:00 Hal Finkel < hfinkel at anl.gov > :
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Mehdi Amini via llvm-dev" < llvm-dev at lists.llvm.org >
> > > > To: "deadal nix" < deadalnix at gmail.com >
> > > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >
> > > > Sent: Wednesday, August 19, 2015 7:24:28 PM
> > > > Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan
> > > >
> > > > Hi,
> > > >
> > > > To be sure, because the RFC below is not detailed and assume
> > > > everyone
> > > > knows about all the emails from 10 months ago,
> > >
> > > I agree. The RFC needs to summarize the problems and the potential
> > > solutions.
> > >
> > > > is there more to do
> > > > than what is proposed in http://reviews.llvm.org/D9766 ?
> > > >
> > > > So basically the proposal is that *InstCombine*
> > >
> > > I think that fixing this early in the optimizer makes sense
> > > (InstCombine, etc.). This seems little different from any other
> > > canonicalization problem. These direct aggregate IR values are
> > > valid
> > > IR, but not our preferred canonical form, so we should transform
> > > the
> > > IR, when possible, into our preferred canonical form.
> > >
> > > -Hal
> > >
> > >
> > >
> > > > turns aggregate
> > > > load/store into a load/store using an integer of equivalent size
> > > > and
> > > > insert the correct bitcast before/after, right?
> > > >
> > > > Example is:
> > > >
> > > > %0 = tail call i8* @allocmemory(i64 32)
> > > > %1 = bitcast i8* %0 to %B*
> > > > store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8
> > > >
> > > > into:
> > > >
> > > > store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to
> > > > i64)
> > > > to i128), i128 774763251095801167872), i128* %1, align 8
> > > >
> > > > Where the aggregate is:
> > > >
> > > > %B__vtbl = type { i8*, i32 (%B*)* }
> > > > @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > —
> > > > Mehdi
> > > >
> > > >
> > > > > On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev
> > > > > < llvm-dev at lists.llvm.org > wrote:
> > > > >
> > > > > It is pretty clear people need this. Let's get this moving.
> > > > >
> > > > > I'll try to sum up the point that have been made and I'll try
> > > > > to
> > > > > address them carefully.
> > > > >
> > > > > 1/ There is no good solution for large aggregates.
> > > > > That is true. However, I don't think this is a reason to not
> > > > > address smaller aggregates, as they appear to be needed.
> > > > > Realistically, the proportion of aggregates that are very large
> > > > > is
> > > > > small, and there is no expectation that such a thing would map
> > > > > nicely to the hardware anyway (the hardware won't have enough
> > > > > registers to load it all anyway). I do think this is reasonable
> > > > > to
> > > > > expect a reasonable handling of relatively small aggregates
> > > > > like
> > > > > fat pointers while accepting that larges ones will be
> > > > > inefficient.
> > > > >
> > > > > This limitation is not unique to the current discussion, as
> > > > > SROA
> > > > > suffer from the same limitation.
> > > > > It is possible to disable to transformation for aggregates that
> > > > > are
> > > > > too large if this is too big of a concern. It should maybe also
> > > > > be
> > > > > done for SROA.
> > > > >
> > > > > 2/ Slicing the aggregate break the semantic of atomic/volatile.
> > > > > That is true. It means slicing the aggregate should not be done
> > > > > for
> > > > > atomic/volatile. It doesn't mean this should not be done for
> > > > > regular ones as it is reasonable to handle atomic/volatile
> > > > > differently. After all, they have different semantic.
> > > > >
> > > > > 3/ Not slicing can create scalar that aren't supported by the
> > > > > target. This is undesirable.
> > > > > Indeed. But as always, the important question is compared to
> > > > > what
> > > > > ?
> > > > >
> > > > > The hardware has no notion of aggregate, so an aggregate or a
> > > > > large
> > > > > scalar ends up both requiring legalization. Doing the
> > > > > transformation is still beneficial :
> > > > > - Some aggregates will generate valid scalars. For such
> > > > > aggregate,
> > > > > this is 100% win.
> > > > > - For aggregate that won't, the situation is still better as
> > > > > various optimization passes will be able to handle the load in
> > > > > a
> > > > > sensible manner.
> > > > > - The transformation never make the situation worse than it is
> > > > > to
> > > > > begin with.
> > > > >
> > > > > On previous discussion, Hal Finkel seemed to think that the
> > > > > scalar
> > > > > solution is preferable to the slicing one.
> > > > >
> > > > > Is that a fair assessment of the situation ? Considering all of
> > > > > this, I think the right path forward is :
> > > > > - Go for the scalar solution in the general case.
> > > > > - If that is a problem, the slicing approach can be used for
> > > > > non
> > > > > atomic/volatile.
> > > > > - If necessary, disable the transformation for very large
> > > > > aggregates (and consider doing so for SROA as well).
> > > > >
> > > > > Do we have a plan ?
> > > > >
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > llvm-dev at lists.llvm.org
> > > > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e=
> > > >
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/a3c45f7c/attachment-0001.html>