[llvm-dev] Aggregate load/stores

Tue Aug 18 18:36:03 PDT 2015

Oh,
and another potential reason for handling aggregate loads and stores 
directly is that it expresses the semantics of the program more clearly, 
which I think should allow LLVM to optimise more aggresively.
Here's a bug report showing a missed optimisation, which I think is due 
to the use of memcpy, which in turn is required to work around slow 
structure loads and stores:
https://llvm.org/bugs/show_bug.cgi?id=23226

Cheers,
   Nick

On 17/08/2015 22:02, mats petersson via llvm-dev wrote:
> I've definitely "run into this problem", and I would very much love to 
> remove my kludges [that are incomplete, because I keep finding places 
> where I need to modify the code-gen to "fix" the same problem - this 
> is probably par for the course from a complete amateur compiler writer 
> and someone that has only spent the last 14 months working (as a 
> hobby) with LLVM].
>
> So whilst I can't contribute much on the "what is the right solution" 
> and "how do we solve this", I would very much like to see something 
> that allows the user of LLVM to use load/store withing things like "is 
> my thing that I'm storing big, if so don't generate a load, use a 
> memcpy instead". Not only does this make the usage of LLVM harder, it 
> also causes slow compilation [perhaps this is a separte problem, but I 
> have a simple program that copies a large struct a few times, and if I 
> turn off my "use memcpy for large things", the compile time gets quite 
> a lot longer - approx 1000x, and 48 seconds is a long time to compile 
> 37 lines of relatively straight forward code - even the Pascal 
> compiler on PDP-11/70 that I used at my school in 1980's was capable 
> of doing more than 1 line per second, and it didn't run anywhere near 
> 2.5GHz and had 20-30 users anytime I could use it...]
>
> ../lacsap -no-memcpy -tt longcompile.pas
> Time for Parse 0.657 ms
> Time for Analyse 0.018 ms
> Time for Compile 1.248 ms
> Time for CreateObject 48803.263 ms
> Time for CreateBinary 48847.631 ms
> Time for Compile 48854.064 ms
>
> compared with:
> ../lacsap -tt longcompile.pas
> Time for Parse 0.455 ms
> Time for Analyse 0.013 ms
> Time for Compile 1.138 ms
> Time for CreateObject 44.627 ms
> Time for CreateBinary 82.758 ms
> Time for Compile 95.797 ms
>
> wc longcompile.pas
>  37  84 410 longcompile.pas
>
> Source here:
> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>
>
> --
> Mats
>
> On 17 August 2015 at 21:18, deadal nix via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>     OK, what about that plan :
>
>     Slice the aggregate into a serie of valid loads/stores for non
>     atomic ones.
>     Use big scalar for atomic/volatile ones.
>     Try to generate memcpy or memmove when possible ?
>
>
>     2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com
>     <mailto:deadalnix at gmail.com>>:
>
>
>
>         2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com
>         <mailto:mehdi.amini at apple.com>>:
>
>             Hi,
>
>>             On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev
>>             <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>
>>
>>             2015-08-16 23:21 GMT-07:00 David Majnemer
>>             <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>>:
>>
>>
>>
>>                 Because a solution which doesn't generalize is not a
>>                 very powerful solution. What happens when somebody
>>                 says that they want to use atomics + large aggregate
>>                 loads and stores? Give them yet another, different
>>                 answer? That would mean our earlier, less general
>>                 answer, approach was either a bandaid (bad) or the
>>                 new answer requires a parallel code path in their
>>                 frontend (worse).
>>
>
>
>             +1 with David’s approach: making thing incrementally
>             better is fine *as long as* the long term direction is
>             identified. Small incremental changes that makes things
>             slightly better in the short term but drives us away of
>             the long term direction is not good.
>
>             Don’t get me wrong, I’m not saying that the current patch
>             is not good, just that it does not seem clear to me that
>             the long term direction has been identified, which explain
>             why some can be nervous about adding stuff prematurely.
>             And I’m not for the status quo, while I can’t judge it
>             definitively myself, I even bugged David last month to
>             look at this revision and try to identify what is really
>             the long term direction and how to make your (and other)
>             frontends’ life easier.
>
>
>
>         As long as there is something to be done. Concern has been
>         raised for very large aggregate (64K, 1Mb) but there is no way
>         a good codegen can come out of these anyway. I don't know of
>         any machine that have 1Mb of register available to tank the
>         load. Even I we had a good way to handle it in InstCombine,
>         the backend would have no capability to generate something
>         nice for it anyway. Most aggregates are small and there is no
>         good excuse to not do anything to handle them because someone
>         could generate gigantic ones that won't map nicely to the
>         hardware anyway.
>
>         By that logic, SROA should not exists as one could generate
>         gigantic aggregate as well (in fact, SROA fail pretty badly on
>         large aggregates).
>
>         The second concern raised is for atomic/volatile, which needs
>         to be handled by the optimizer differently anyway, so is
>         mostly irrelevant here.
>
>>
>>
>>             clang has many developer behind it, some of them paid to
>>             work on it. That s simply not the case for many others.
>>
>>             But to answer your questions :
>>              - Per field load/store generate more loads/stores than
>>             necessary in many cases. These can't be aggregated back
>>             because of padding.
>>              - memcpy only work memory to memory. It is certainly
>>             usable in some cases, but certainly do not cover all uses.
>>
>>             I'm willing to do the memcpy optimization in InstCombine
>>             (in fact, things would not degenerate into so much
>>             bikescheding, that would already be done).
>
>             Calling out “bikescheding” what other devs think is what
>             keeps the quality of the project high is unlikely to help
>             your patch go through, it’s probably quite the opposite
>             actually.
>
>
>
>         I understand the desire to keep quality high. That's is not
>         where the problem is. The problem lies into discussing actual
>         proposal against hypothetical perfect ones that do not exists.
>
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://llvm.cs.uiuc.edu
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150819/17048d7b/attachment.html>