<div dir="ltr">I played a bit with the alloca based approach. Here is a simple version of it:<br><div><br><a href="http://reviews.llvm.org/D12269">http://reviews.llvm.org/D12269</a><br><br></div><div>Overall, it seems to play nicely with the rest of LLVM.<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-08-21 15:27 GMT-07:00 Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>

> From: "Philip Reames" <<a href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>><br>

> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>>, "deadal nix" <<a href="mailto:deadalnix@gmail.com">deadalnix@gmail.com</a>><br>

> Cc: "llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

</span><span class="">> Sent: Friday, August 21, 2015 5:20:34 PM<br>

> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan<br>

><br>

><br>

><br>

</span><div><div class="h5">> On 08/21/2015 02:46 PM, Hal Finkel via llvm-dev wrote:<br>

> > ----- Original Message -----<br>

> >> From: "deadal nix" <<a href="mailto:deadalnix@gmail.com">deadalnix@gmail.com</a>><br>

> >> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

> >> Cc: "Mehdi Amini" <<a href="mailto:mehdi.amini@apple.com">mehdi.amini@apple.com</a>>, "llvm-dev"<br>

> >> <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

> >> Sent: Friday, August 21, 2015 1:24:04 AM<br>

> >> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan<br>

> >><br>

> >> I've done some investigation. The proposed approach seems to work<br>

> >> well on struct with no padding, but the optimizer gets confused<br>

> >> when<br>

> >> the struct has some. That's a bummer but that is already improving<br>

> >> things, so I'd rather have that than nothing.<br>

> >><br>

> >> Alternatively, I think I have, in theory only so far, that may be<br>

> >> better. The idea is to create an alloca, memcpy the aggregate in<br>

> >> it,<br>

> >> and transform extract values into loads, and insertvalues into an<br>

> >> optional memecy into a new alloca (if the old value is still alive<br>

> >> after the insertvalue) + a store. There is some extra trickery one<br>

> >> need to do for call arguments, ret and phi nodes, but overall, I<br>

> >> think this can work. This approach would rely on SROA to generate<br>

> >> something sensible. As it is going to do massive substitutions,<br>

> >> I'm<br>

> >> not sure InstCombine would be the right place to do it. Maybe its<br>

> >> own pass ?<br>

> >><br>

> > I think this might make more sense; it sounds closer to the<br>

> > transformation that Clang does for C-level aggregates, and should<br>

> > play nicer with SROA, etc. I'm not entirely sure where it should<br>

> > go, but feel free to prototype it as its own pass, and if it works<br>

> > well, we'll worry about pipeline placement.<br>

> ><br>

> >> That wouldn't work for atomic/volatile, but I think that would<br>

> >> work<br>

> >> nicely for regular load/stores.<br>

> >><br>

> > We only support atomics on integer pointee types anyway, so atomics<br>

> > are not relevant.<br>

> OT: I plan to change this at some point.  In particular, I need<br>

> atomic<br>

> float and double access.  Doing the bitcasts in the frontend works,<br>

> but<br>

> given there's an obvious lowering, I don't see why we shouldn't do<br>

> this<br>

> within LLVM.<br>

<br>

</div></div>I agree; I'd like them for float/double as well.<br>

<span class="HOEnZb"><font color="#888888"><br>

 -Hal<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

> ><br>

> >   -Hal<br>

> ><br>

> >><br>

> >> How does that sound ?<br>

> >><br>

> >><br>

> >><br>

> >><br>

> >> 2015-08-20 16:38 GMT-07:00 Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> > :<br>

> >><br>

> >><br>

> >> ----- Original Message -----<br>

> >>> From: "deadal nix" < <a href="mailto:deadalnix@gmail.com">deadalnix@gmail.com</a> ><br>

> >>> To: "Hal Finkel" < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> >>> Cc: "Mehdi Amini" < <a href="mailto:mehdi.amini@apple.com">mehdi.amini@apple.com</a> >, "llvm-dev" <<br>

> >>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>

> >>> Sent: Thursday, August 20, 2015 4:09:17 PM<br>

> >>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan<br>

> >>><br>

> >><br>

> >>> Problem :<br>

> >>><br>

> >>> Many languages define aggregates and some way to manipulate them.<br>

> >>> LLVM define aggregates types (arrays and structs) to handle them.<br>

> >>> However, when aggregate are loaded or stored, LLVM will simply<br>

> >>> ignore these up to the legalization in the backend. This lead to<br>

> >>> many misoptimizations. Most frontend are using a set of trick to<br>

> >>> work around this limtation, but that an undesirable situation as<br>

> >>> it<br>

> >>> increase the work required to write a front end. Ideally that<br>

> >>> work<br>

> >>> should be done once by LLVM instead of every time by each<br>

> >>> frontend.<br>

> >>><br>

> >>> In previous discussion on the subject, many LLVM user have<br>

> >>> expressed<br>

> >>> interest in being able to use aggregate memory access. In<br>

> >>> addition,<br>

> >>> it is likely that it would have reduced the workload of some<br>

> >>> existing frontends.<br>

> >>><br>

> >>> The proposed solution is to transform aggregate loads and stores<br>

> >>> to<br>

> >>> something that the LLVM toolcahin already understand and is able<br>

> >>> to<br>

> >>> work with.<br>

> >>><br>

> >>> The proposed solution will use InstCombine to do the<br>

> >>> transformation as it is done early and will allow subsequent<br>

> >>> passes<br>

> >>> to work with something familiar (basically, canonicalization).<br>

> >>><br>

> >>><br>

> >>> Proposed solution :<br>

> >>><br>

> >>><br>

> >>> Aggregate load and store are turned into aggregate load and store<br>

> >>> of<br>

> >>> a scalar of the same size and alignement. Binary manipulation,<br>

> >>> like<br>

> >>> mask and shift, are used to build the aggregate from the scalar<br>

> >>> after loading and the aggregate to a scalar when storing.<br>

> >>><br>

> >>><br>

> >>> For instance, the following IR (extracted from a D frontend) :<br>

> >>><br>

> >>> %B__vtbl = type { i8*, i32 (%B*)* }<br>

> >>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }<br>

> >>><br>

> >>> %0 = tail call i8* @allocmemory(i64 32)<br>

> >>><br>

> >>> %1 = bitcast i8* %0 to %B*<br>

> >>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8<br>

> >>><br>

> >>><br>

> >>> Would be canonized into :<br>

> >>> %0 = tail call i8* @allocmemory(i64 32)<br>

> >>> %1 = bitcast i8* %0 to i128*<br>

> >>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to<br>

> >>> i64)<br>

> >>> to i128), i128 774763251095801167872), i128* %1, align 8<br>

> >>><br>

> >> As I've said before, we really should introduce some useful<br>

> >> canonicalization for these, and so I support the investigations in<br>

> >> this area.<br>

> >><br>

> >> Have you verified that InstCombine that undo this transformation<br>

> >> reasonably? Specifically, if you have a bunch of insertvalues, and<br>

> >> an aggregate store in one function, and a load in another function<br>

> >> with a bunch of extractvalues, which gets inlined into the<br>

> >> function<br>

> >> with the store, does everything get combined appropriately to<br>

> >> forward the values directly regardless of how many of the members<br>

> >> are extracted and in what order?<br>

> >><br>

> >> -Hal<br>

> >><br>

> >><br>

> >><br>

> >>><br>

> >>> Which the rest of the LLVM pipeline can work with.<br>

> >>><br>

> >>><br>

> >>><br>

> >>> Limitations :<br>

> >>><br>

> >>><br>

> >>> 1/ This solution will not handle properly large (tens of<br>

> >>> kilobytes)<br>

> >>> aggregates. It is an accepted limitation, both for this proposal<br>

> >>> and<br>

> >>> other part of the pipeline that handle aggregates. Optionally,<br>

> >>> checks can be added both for this canonicalization and SROA to<br>

> >>> disable them on very large aggregates as to avoid wasting work<br>

> >>> that<br>

> >>> won't yield good codegen at the end anyway. This limitation<br>

> >>> should<br>

> >>> not be a blocker as most aggregate are fairly small. For<br>

> >>> instance,<br>

> >>> some language make heavy use of fat pointers, and would greatly<br>

> >>> benefit from this canonicalization.<br>

> >>><br>

> >>><br>

> >>> 2/ This solution will generate loads and stores of value that may<br>

> >>> not<br>

> >>> be natively supported by the hardware. The hardware do not<br>

> >>> natively<br>

> >>> support aggregate to begin with, so both original IR and<br>

> >>> canonized<br>

> >>> IR will require optimization. This is not ideal, but the<br>

> >>> canonicalization is a plus for 2 reasons:<br>

> >>> - A subset of these memory access won't need canonicalization<br>

> >>> anymore.<br>

> >>><br>

> >>> - Other passes in LLVM will be able to work with these load and<br>

> >>> perform adequate transformations.<br>

> >>><br>

> >>><br>

> >>><br>

> >>> Possible alternatives :<br>

> >>><br>

> >>><br>

> >>> In order to mitigate 1/ it is possible to gate the<br>

> >>> canonicalization<br>

> >>> to aggregate under a certain size. This essentially avoiding to<br>

> >>> do<br>

> >>> work that will lead to bad codegen no matter what.<br>

> >>><br>

> >>> In order to mitigate 2/, it is possible to slice aggregates loads<br>

> >>> and<br>

> >>> stores according to the target's data layout. This CANNOT be<br>

> >>> implemented for atomic/volatile as it would change semantic, but<br>

> >>> can<br>

> >>> be done for regulars ones, which are the most commons.<br>

> >>><br>

> >>><br>

> >>><br>

> >>><br>

> >>> Do that looks better as an RFC ?<br>

> >>><br>

> >>><br>

> >>><br>

> >>><br>

> >>><br>

> >>> 2015-08-19 22:11 GMT-07:00 Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> > :<br>

> >>><br>

> >>><br>

> >>> ----- Original Message -----<br>

> >>>> From: "Mehdi Amini via llvm-dev" < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>

> >>>> To: "deadal nix" < <a href="mailto:deadalnix@gmail.com">deadalnix@gmail.com</a> ><br>

> >>>> Cc: "llvm-dev" < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> ><br>

> >>>> Sent: Wednesday, August 19, 2015 7:24:28 PM<br>

> >>>> Subject: Re: [llvm-dev] [RFC] Aggreate load/store, proposed plan<br>

> >>>><br>

> >>>> Hi,<br>

> >>>><br>

> >>>> To be sure, because the RFC below is not detailed and assume<br>

> >>>> everyone<br>

> >>>> knows about all the emails from 10 months ago,<br>

> >>> I agree. The RFC needs to summarize the problems and the<br>

> >>> potential<br>

> >>> solutions.<br>

> >>><br>

> >>>> is there more to do<br>

> >>>> than what is proposed in <a href="http://reviews.llvm.org/D9766" rel="noreferrer" target="_blank">http://reviews.llvm.org/D9766</a> ?<br>

> >>>><br>

> >>>> So basically the proposal is that *InstCombine*<br>

> >>> I think that fixing this early in the optimizer makes sense<br>

> >>> (InstCombine, etc.). This seems little different from any other<br>

> >>> canonicalization problem. These direct aggregate IR values are<br>

> >>> valid<br>

> >>> IR, but not our preferred canonical form, so we should transform<br>

> >>> the<br>

> >>> IR, when possible, into our preferred canonical form.<br>

> >>><br>

> >>> -Hal<br>

> >>><br>

> >>><br>

> >>><br>

> >>>> turns aggregate<br>

> >>>> load/store into a load/store using an integer of equivalent size<br>

> >>>> and<br>

> >>>> insert the correct bitcast before/after, right?<br>

> >>>><br>

> >>>> Example is:<br>

> >>>><br>

> >>>> %0 = tail call i8* @allocmemory(i64 32)<br>

> >>>> %1 = bitcast i8* %0 to %B*<br>

> >>>> store %B { %B__vtbl* @B__vtblZ, i32 42 }, %B* %1, align 8<br>

> >>>><br>

> >>>> into:<br>

> >>>><br>

> >>>> store i128 or (i128 zext (i64 ptrtoint (%B__vtbl* @B__vtblZ to<br>

> >>>> i64)<br>

> >>>> to i128), i128 774763251095801167872), i128* %1, align 8<br>

> >>>><br>

> >>>> Where the aggregate is:<br>

> >>>><br>

> >>>> %B__vtbl = type { i8*, i32 (%B*)* }<br>

> >>>> @B__vtblZ = constant %B__vtbl { i8* null, i32 (%B*)* @B.foo }<br>

> >>>><br>

> >>>><br>

> >>>> Thanks,<br>

> >>>><br>

> >>>> —<br>

> >>>> Mehdi<br>

> >>>><br>

> >>>><br>

> >>>>> On Aug 19, 2015, at 5:02 PM, deadal nix via llvm-dev<br>

> >>>>> < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> > wrote:<br>

> >>>>><br>

> >>>>> It is pretty clear people need this. Let's get this moving.<br>

> >>>>><br>

> >>>>> I'll try to sum up the point that have been made and I'll try<br>

> >>>>> to<br>

> >>>>> address them carefully.<br>

> >>>>><br>

> >>>>> 1/ There is no good solution for large aggregates.<br>

> >>>>> That is true. However, I don't think this is a reason to not<br>

> >>>>> address smaller aggregates, as they appear to be needed.<br>

> >>>>> Realistically, the proportion of aggregates that are very large<br>

> >>>>> is<br>

> >>>>> small, and there is no expectation that such a thing would map<br>

> >>>>> nicely to the hardware anyway (the hardware won't have enough<br>

> >>>>> registers to load it all anyway). I do think this is reasonable<br>

> >>>>> to<br>

> >>>>> expect a reasonable handling of relatively small aggregates<br>

> >>>>> like<br>

> >>>>> fat pointers while accepting that larges ones will be<br>

> >>>>> inefficient.<br>

> >>>>><br>

> >>>>> This limitation is not unique to the current discussion, as<br>

> >>>>> SROA<br>

> >>>>> suffer from the same limitation.<br>

> >>>>> It is possible to disable to transformation for aggregates that<br>

> >>>>> are<br>

> >>>>> too large if this is too big of a concern. It should maybe also<br>

> >>>>> be<br>

> >>>>> done for SROA.<br>

> >>>>><br>

> >>>>> 2/ Slicing the aggregate break the semantic of atomic/volatile.<br>

> >>>>> That is true. It means slicing the aggregate should not be done<br>

> >>>>> for<br>

> >>>>> atomic/volatile. It doesn't mean this should not be done for<br>

> >>>>> regular ones as it is reasonable to handle atomic/volatile<br>

> >>>>> differently. After all, they have different semantic.<br>

> >>>>><br>

> >>>>> 3/ Not slicing can create scalar that aren't supported by the<br>

> >>>>> target. This is undesirable.<br>

> >>>>> Indeed. But as always, the important question is compared to<br>

> >>>>> what<br>

> >>>>> ?<br>

> >>>>><br>

> >>>>> The hardware has no notion of aggregate, so an aggregate or a<br>

> >>>>> large<br>

> >>>>> scalar ends up both requiring legalization. Doing the<br>

> >>>>> transformation is still beneficial :<br>

> >>>>> - Some aggregates will generate valid scalars. For such<br>

> >>>>> aggregate,<br>

> >>>>> this is 100% win.<br>

> >>>>> - For aggregate that won't, the situation is still better as<br>

> >>>>> various optimization passes will be able to handle the load in<br>

> >>>>> a<br>

> >>>>> sensible manner.<br>

> >>>>> - The transformation never make the situation worse than it is<br>

> >>>>> to<br>

> >>>>> begin with.<br>

> >>>>><br>

> >>>>> On previous discussion, Hal Finkel seemed to think that the<br>

> >>>>> scalar<br>

> >>>>> solution is preferable to the slicing one.<br>

> >>>>><br>

> >>>>> Is that a fair assessment of the situation ? Considering all of<br>

> >>>>> this, I think the right path forward is :<br>

> >>>>> - Go for the scalar solution in the general case.<br>

> >>>>> - If that is a problem, the slicing approach can be used for<br>

> >>>>> non<br>

> >>>>> atomic/volatile.<br>

> >>>>> - If necessary, disable the transformation for very large<br>

> >>>>> aggregates (and consider doing so for SROA as well).<br>

> >>>>><br>

> >>>>> Do we have a plan ?<br>

> >>>>><br>

> >>>>> _______________________________________________<br>

> >>>>> LLVM Developers mailing list<br>

> >>>>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> >>>>> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=KkqzAZMcLUlWa3Uwmbr4DQqJdYQAzN_pFY3M8dzVdZ8&s=SFb1jraizjgechN0Pq3738tzBZyK8dZRqIU8Zfi_Qns&e=</a><br>

> >>>> _______________________________________________<br>

> >>>> LLVM Developers mailing list<br>

> >>>> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> >>>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

> >>>><br>

> >>> --<br>

> >>> Hal Finkel<br>

> >>> Assistant Computational Scientist<br>

> >>> Leadership Computing Facility<br>

> >>> Argonne National Laboratory<br>

> >>><br>

> >>><br>

> >> --<br>

> >> Hal Finkel<br>

> >> Assistant Computational Scientist<br>

> >> Leadership Computing Facility<br>

> >> Argonne National Laboratory<br>

> >><br>

> >><br>

><br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</div></div></blockquote></div><br></div>