[llvm-dev] Aggregate load/stores

Wed Aug 19 10:06:29 PDT 2015

This thread is deep enough and the start of it confrontational enough, 
that I doubt enough people are reading this deep.  Please rephrase this 
as a separate RFC to ensure visibility.

For the record, the overall direction your sketching seems entirely 
reasonable to me.

Philip

On 08/18/2015 10:31 PM, deadal nix via llvm-dev wrote:
> It is pretty clear people need this. Let's get this moving.
>
> I'll try to sum up the point that have been made and I'll try to 
> address them carefully.
>
> 1/ There is no good solution for large aggregates.
> That is true. However, I don't think this is a reason to not address 
> smaller aggregates, as they appear to be needed. Realistically, the 
> proportion of aggregates that are very large is small, and there is no 
> expectation that such a thing would map nicely to the hardware anyway 
> (the hardware won't have enough registers to load it all anyway). I do 
> think this is reasonable to expect a reasonable handling of relatively 
> small aggregates like fat pointers while accepting that larges ones 
> will be inefficient.
>
> This limitation is not unique to the current discussion, as SROA 
> suffer from the same limitation.
> It is possible to disable to transformation for aggregates that are 
> too large if this is too big of a concern. It should maybe also be 
> done for SROA.
>
> 2/ Slicing the aggregate break the semantic of atomic/volatile.
> That is true. It means slicing the aggregate should not be done for 
> atomic/volatile. It doesn't mean this should not be done for regular 
> ones as it is reasonable to handle atomic/volatile differently. After 
> all, they have different semantic.
>
> 3/ Not slicing can create scalar that aren't supported by the target. 
> This is undesirable.
> Indeed. But as always, the important question is compared to what ?
>
> The hardware has no notion of aggregate, so an aggregate or a large 
> scalar ends up both requiring legalization. Doing the transformation 
> is still beneficial :
>  - Some aggregates will generate valid scalars. For such aggregate, 
> this is 100% win.
>  - For aggregate that won't, the situation is still better as various 
> optimization passes will be able to handle the load in a sensible manner.
>  - The transformation never make the situation worse than it is to 
> begin with.
>
> On previous discussion, Hal Finkel seemed to think that the scalar 
> solution is preferable to the slicing one.
>
> Is that a fair assessment of the situation ? Considering all of this, 
> I think the right path forward is :
>  - Go for the scalar solution in the general case.
>  - If that is a problem, the slicing approach can be used for non 
> atomic/volatile.
>  - If necessary, disable the transformation for very large aggregates 
> (and consider doing so for SROA as well).
>
> Do we have a plan ?
>
>
> 2015-08-18 18:36 GMT-07:00 Nicholas Chapman via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>
>     Oh,
>     and another potential reason for handling aggregate loads and
>     stores directly is that it expresses the semantics of the program
>     more clearly, which I think should allow LLVM to optimise more
>     aggresively.
>     Here's a bug report showing a missed optimisation, which I think
>     is due to the use of memcpy, which in turn is required to work
>     around slow structure loads and stores:
>     https://llvm.org/bugs/show_bug.cgi?id=23226
>
>     Cheers,
>       Nick
>     On 17/08/2015 22:02, mats petersson via llvm-dev wrote:
>>     I've definitely "run into this problem", and I would very much
>>     love to remove my kludges [that are incomplete, because I keep
>>     finding places where I need to modify the code-gen to "fix" the
>>     same problem - this is probably par for the course from a
>>     complete amateur compiler writer and someone that has only spent
>>     the last 14 months working (as a hobby) with LLVM].
>>
>>     So whilst I can't contribute much on the "what is the right
>>     solution" and "how do we solve this", I would very much like to
>>     see something that allows the user of LLVM to use load/store
>>     withing things like "is my thing that I'm storing big, if so
>>     don't generate a load, use a memcpy instead". Not only does this
>>     make the usage of LLVM harder, it also causes slow compilation
>>     [perhaps this is a separte problem, but I have a simple program
>>     that copies a large struct a few times, and if I turn off my "use
>>     memcpy for large things", the compile time gets quite a lot
>>     longer - approx 1000x, and 48 seconds is a long time to compile
>>     37 lines of relatively straight forward code - even the Pascal
>>     compiler on PDP-11/70 that I used at my school in 1980's was
>>     capable of doing more than 1 line per second, and it didn't run
>>     anywhere near 2.5GHz and had 20-30 users anytime I could use it...]
>>
>>     ../lacsap -no-memcpy -tt longcompile.pas
>>     Time for Parse 0.657 ms
>>     Time for Analyse 0.018 ms
>>     Time for Compile 1.248 ms
>>     Time for CreateObject 48803.263 ms
>>     Time for CreateBinary 48847.631 ms
>>     Time for Compile 48854.064 ms
>>
>>     compared with:
>>     ../lacsap -tt longcompile.pas
>>     Time for Parse 0.455 ms
>>     Time for Analyse 0.013 ms
>>     Time for Compile 1.138 ms
>>     Time for CreateObject 44.627 ms
>>     Time for CreateBinary 82.758 ms
>>     Time for Compile 95.797 ms
>>
>>     wc longcompile.pas
>>      37  84 410 longcompile.pas
>>
>>     Source here:
>>     https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>>
>>
>>     --
>>     Mats
>>
>>     On 17 August 2015 at 21:18, deadal nix via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>         OK, what about that plan :
>>
>>         Slice the aggregate into a serie of valid loads/stores for
>>         non atomic ones.
>>         Use big scalar for atomic/volatile ones.
>>         Try to generate memcpy or memmove when possible ?
>>
>>
>>         2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com
>>         <mailto:deadalnix at gmail.com>>:
>>
>>
>>
>>             2015-08-17 11:26 GMT-07:00 Mehdi Amini
>>             <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>>:
>>
>>                 Hi,
>>
>>>                 On Aug 17, 2015, at 12:13 AM, deadal nix via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>
>>>
>>>                 2015-08-16 23:21 GMT-07:00 David Majnemer
>>>                 <david.majnemer at gmail.com
>>>                 <mailto:david.majnemer at gmail.com>>:
>>>
>>>
>>>
>>>                     Because a solution which doesn't generalize is
>>>                     not a very powerful solution. What happens when
>>>                     somebody says that they want to use atomics +
>>>                     large aggregate loads and stores? Give them yet
>>>                     another, different answer? That would mean our
>>>                     earlier, less general answer, approach was
>>>                     either a bandaid (bad) or the new answer
>>>                     requires a parallel code path in their frontend
>>>                     (worse).
>>>
>>
>>
>>                 +1 with David’s approach: making thing incrementally
>>                 better is fine *as long as* the long term direction
>>                 is identified. Small incremental changes that makes
>>                 things slightly better in the short term but drives
>>                 us away of the long term direction is not good.
>>
>>                 Don’t get me wrong, I’m not saying that the current
>>                 patch is not good, just that it does not seem clear
>>                 to me that the long term direction has been
>>                 identified, which explain why some can be nervous
>>                 about adding stuff prematurely.
>>                 And I’m not for the status quo, while I can’t judge
>>                 it definitively myself, I even bugged David last
>>                 month to look at this revision and try to identify
>>                 what is really the long term direction and how to
>>                 make your (and other) frontends’ life easier.
>>
>>
>>
>>             As long as there is something to be done. Concern has
>>             been raised for very large aggregate (64K, 1Mb) but there
>>             is no way a good codegen can come out of these anyway. I
>>             don't know of any machine that have 1Mb of register
>>             available to tank the load. Even I we had a good way to
>>             handle it in InstCombine, the backend would have no
>>             capability to generate something nice for it anyway. Most
>>             aggregates are small and there is no good excuse to not
>>             do anything to handle them because someone could generate
>>             gigantic ones that won't map nicely to the hardware anyway.
>>
>>             By that logic, SROA should not exists as one could
>>             generate gigantic aggregate as well (in fact, SROA fail
>>             pretty badly on large aggregates).
>>
>>             The second concern raised is for atomic/volatile, which
>>             needs to be handled by the optimizer differently anyway,
>>             so is mostly irrelevant here.
>>
>>>
>>>
>>>                 clang has many developer behind it, some of them
>>>                 paid to work on it. That s simply not the case for
>>>                 many others.
>>>
>>>                 But to answer your questions :
>>>                  - Per field load/store generate more loads/stores
>>>                 than necessary in many cases. These can't be
>>>                 aggregated back because of padding.
>>>                  - memcpy only work memory to memory. It is
>>>                 certainly usable in some cases, but certainly do not
>>>                 cover all uses.
>>>
>>>                 I'm willing to do the memcpy optimization in
>>>                 InstCombine (in fact, things would not degenerate
>>>                 into so much bikescheding, that would already be done).
>>
>>                 Calling out “bikescheding” what other devs think is
>>                 what keeps the quality of the project high is
>>                 unlikely to help your patch go through, it’s probably
>>                 quite the opposite actually.
>>
>>
>>
>>             I understand the desire to keep quality high. That's is
>>             not where the problem is. The problem lies into
>>             discussing actual proposal against hypothetical perfect
>>             ones that do not exists.
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>         http://llvm.cs.uiuc.edu
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>          http://llvm.cs.uiuc.edu
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/30e0e365/attachment-0001.html>