[llvm-dev] Aggregate load/stores

Mon Aug 17 14:52:31 PDT 2015

That's kind of my point - it turns load/store into lots of instrucitons,
and the suggested solution that I got when I pointed that out was "well,
you should use memcpy for large data structures, there is an intrinsic for
it". This led to this little function:
https://github.com/Leporacanthicus/lacsap/blob/master/expr.cpp#L305
along with a few other bits and pieces that do similar "if it's big enough,
call memcpy".

I'm not sure if the results are better on any other processor architecture
- since my home setup consists only of x86-64 machines, I haven't
experimented with anything else.

--
Mats

On 17 August 2015 at 22:43, Mehdi Amini <mehdi.amini at apple.com> wrote:

> The instruction selection for X86 turns:
>
> define void @P.p1(%1* byval) {
> entry:
>   %y = alloca %1, align 8
>   %1 = load %1, %1* %0
>   store %1 %1, %1* %y
>   %valueindex2 = bitcast %1* %y to [8000 x i32]*
>   %valueindex1 = getelementptr [8000 x i32], [8000 x i32]* %valueindex2,
> i32 0, i32 1
>   %2 = load i32, i32* %valueindex1
>   call void @__write_int(%0* @output, i32 %2, i32 1)
>   call void @__write_nl(%0* @output)
>   ret void
> }
>
> into 16014 instructions, it sounds pretty terrible :(
>
> —
> Mehdi
>
>
> On Aug 17, 2015, at 2:35 PM, mats petersson <mats at planetcatfish.com>
> wrote:
>
> Even if I turn to -O0 [in other words, no optimisation passes at all], it
> takes the same amount of time.
>
> The time is spent in
>
>   12.94%  lacsap   lacsap               [.]
> llvm::SDNode::use_iterator::operator==
>    7.68%  lacsap   lacsap               [.]
> llvm::SDNode::use_iterator::operator*
>    7.53%  lacsap   lacsap               [.]
> llvm::SelectionDAG::ReplaceAllUsesOfValueWith
>    7.28%  lacsap   lacsap               [.]
> llvm::SDNode::use_iterator::operator++
>    5.59%  lacsap   lacsap               [.]
> llvm::SDNode::use_iterator::operator!=
>    4.65%  lacsap   lacsap               [.] llvm::SDNode::hasNUsesOfValue
>    3.82%  lacsap   lacsap               [.] llvm::SDUse::getResNo
>    2.33%  lacsap   lacsap               [.] llvm::SDValue::getResNo
>    2.19%  lacsap   lacsap               [.] llvm::SDUse::getNext
>    1.32%  lacsap   lacsap               [.]
> llvm::SDNode::use_iterator::getUse
>    1.28%  lacsap   lacsap               [.] llvm::SDUse::getUser
>
> Here's the LLVM IR generated:
> https://gist.github.com/Leporacanthicus/9b662f88e0c4a471e51a
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_Leporacanthicus_9b662f88e0c4a471e51a&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=t42BDaUOTxvFWFFyDD5N6v0RTT2Ofi_0iiM2QD_GmsY&s=IvIKLNrYk-LiQRnykH-gcm2FbHLZmPhWP6sHJ4JQ310&e=>
>
> And as can be seen here -O0 produces "no passes":
> https://github.com/Leporacanthicus/lacsap/blob/master/lacsap.cpp#L76
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Leporacanthicus_lacsap_blob_master_lacsap.cpp-23L76&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=t42BDaUOTxvFWFFyDD5N6v0RTT2Ofi_0iiM2QD_GmsY&s=XmgT2qNVAYvxKgoiwtfHU7PmTKQCpJ25a-a0l8ZW6HM&e=>
>
> ../lacsap -no-memcpy -tt longcompile.pas  -O0
> Time for Parse 0.502 ms
> Time for Analyse 0.015 ms
> Time for Compile 1.038 ms
> Time for CreateObject 48134.541 ms
> Time for CreateBinary 48179.720 ms
> Time for Compile 48187.351 ms
>
> And before someone says "but you are running a debug build", if I run the
> "production", it does speed things up quite nicely, about 3x, but still
> takes 17 seconds vs 45ms with that build of the compiler.
>
>
> ../lacsap -no-memcpy -tt longcompile.pas  -O0
> Time for Parse 0.937 ms
> Time for Analyse 0.005 ms
> Time for Compile 0.559 ms
> Time for CreateObject 17241.177 ms
> Time for CreateBinary 17286.701 ms
> Time for Compile 17289.187 ms
>
> ../lacsap -tt longcompile.pas
> Time for Parse 0.274 ms
> Time for Analyse 0.004 ms
> Time for Compile 0.258 ms
> Time for CreateObject 7.504 ms
> Time for CreateBinary 45.405 ms
> Time for Compile 46.670 ms
>
> I believe I know what happens: The compiler is trying to figure out the
> best order of instructions, and looks at N^2 instructions that are pretty
> much independently executable with no code or data dependencies. So it
> iterates over a vast number of possible permutations, only to find that
> they are all pretty much equally good/bad... But like I said earlier,
> although I'm a professional software engineer, compilers are just a
> hobby-project for me, and I only started a little over a year back, so I
> make no pretense to know the answer. Using memcpy instead solves this
> problem, as it
>
>
>
>
>
>
> --
> Mats
>
> On 17 August 2015 at 22:05, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>> Hi Mats,
>>
>> The performance issue seems like a potential different issue.
>> Can you send the input IR in both cases and the list of passes you are
>> running?
>>
>> Thanks,
>>
>> —
>> Mehdi
>>
>>
>> On Aug 17, 2015, at 2:02 PM, mats petersson <mats at planetcatfish.com>
>> wrote:
>>
>> I've definitely "run into this problem", and I would very much love to
>> remove my kludges [that are incomplete, because I keep finding places where
>> I need to modify the code-gen to "fix" the same problem - this is probably
>> par for the course from a complete amateur compiler writer and someone that
>> has only spent the last 14 months working (as a hobby) with LLVM].
>>
>> So whilst I can't contribute much on the "what is the right solution" and
>> "how do we solve this", I would very much like to see something that allows
>> the user of LLVM to use load/store withing things like "is my thing that
>> I'm storing big, if so don't generate a load, use a memcpy instead". Not
>> only does this make the usage of LLVM harder, it also causes slow
>> compilation [perhaps this is a separte problem, but I have a simple program
>> that copies a large struct a few times, and if I turn off my "use memcpy
>> for large things", the compile time gets quite a lot longer - approx 1000x,
>> and 48 seconds is a long time to compile 37 lines of relatively straight
>> forward code - even the Pascal compiler on PDP-11/70 that I used at my
>> school in 1980's was capable of doing more than 1 line per second, and it
>> didn't run anywhere near 2.5GHz and had 20-30 users anytime I could use
>> it...]
>>
>> ../lacsap -no-memcpy -tt longcompile.pas
>> Time for Parse 0.657 ms
>> Time for Analyse 0.018 ms
>> Time for Compile 1.248 ms
>> Time for CreateObject 48803.263 ms
>> Time for CreateBinary 48847.631 ms
>> Time for Compile 48854.064 ms
>>
>> compared with:
>> ../lacsap -tt longcompile.pas
>> Time for Parse 0.455 ms
>> Time for Analyse 0.013 ms
>> Time for Compile 1.138 ms
>> Time for CreateObject 44.627 ms
>> Time for CreateBinary 82.758 ms
>> Time for Compile 95.797 ms
>>
>> wc longcompile.pas
>>  37  84 410 longcompile.pas
>>
>> Source here:
>> https://github.com/Leporacanthicus/lacsap/blob/master/test/longcompile.pas
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Leporacanthicus_lacsap_blob_master_test_longcompile.pas&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=6BM4NTaxZDH8Gd6oekl1GjVZGnKT-5VY6_8gGb61Nkk&e=>
>>
>>
>> --
>> Mats
>>
>> On 17 August 2015 at 21:18, deadal nix via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> OK, what about that plan :
>>>
>>> Slice the aggregate into a serie of valid loads/stores for non atomic
>>> ones.
>>> Use big scalar for atomic/volatile ones.
>>> Try to generate memcpy or memmove when possible ?
>>>
>>>
>>> 2015-08-17 12:16 GMT-07:00 deadal nix <deadalnix at gmail.com>:
>>>
>>>>
>>>>
>>>> 2015-08-17 11:26 GMT-07:00 Mehdi Amini <mehdi.amini at apple.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> On Aug 17, 2015, at 12:13 AM, deadal nix via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 2015-08-16 23:21 GMT-07:00 David Majnemer <david.majnemer at gmail.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> Because a solution which doesn't generalize is not a very powerful
>>>>>> solution.  What happens when somebody says that they want to use atomics +
>>>>>> large aggregate loads and stores? Give them yet another, different answer?
>>>>>> That would mean our earlier, less general answer, approach was either a
>>>>>> bandaid (bad) or the new answer requires a parallel code path in their
>>>>>> frontend (worse).
>>>>>>
>>>>>
>>>>>
>>>>> +1 with David’s approach: making thing incrementally better is fine
>>>>> *as long as* the long term direction is identified. Small incremental
>>>>> changes that makes things slightly better in the short term but drives us
>>>>> away of the long term direction is not good.
>>>>>
>>>>> Don’t get me wrong, I’m not saying that the current patch is not good,
>>>>> just that it does not seem clear to me that the long term direction has
>>>>> been identified, which explain why some can be nervous about adding stuff
>>>>> prematurely.
>>>>> And I’m not for the status quo, while I can’t judge it definitively
>>>>> myself, I even bugged David last month to look at this revision and try to
>>>>> identify what is really the long term direction and how to make your (and
>>>>> other) frontends’ life easier.
>>>>>
>>>>>
>>>>>
>>>> As long as there is something to be done. Concern has been raised for
>>>> very large aggregate (64K, 1Mb) but there is no way a good codegen can come
>>>> out of these anyway. I don't know of any machine that have 1Mb of register
>>>> available to tank the load. Even I we had a good way to handle it in
>>>> InstCombine, the backend would have no capability to generate something
>>>> nice for it anyway. Most aggregates are small and there is no good excuse
>>>> to not do anything to handle them because someone could generate gigantic
>>>> ones that won't map nicely to the hardware anyway.
>>>>
>>>> By that logic, SROA should not exists as one could generate gigantic
>>>> aggregate as well (in fact, SROA fail pretty badly on large aggregates).
>>>>
>>>> The second concern raised is for atomic/volatile, which needs to be
>>>> handled by the optimizer differently anyway, so is mostly irrelevant here.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> clang has many developer behind it, some of them paid to work on it.
>>>>> That s simply not the case for many others.
>>>>>
>>>>> But to answer your questions :
>>>>>  - Per field load/store generate more loads/stores than necessary in
>>>>> many cases. These can't be aggregated back because of padding.
>>>>>  - memcpy only work memory to memory. It is certainly usable in some
>>>>> cases, but certainly do not cover all uses.
>>>>>
>>>>> I'm willing to do the memcpy optimization in InstCombine (in fact,
>>>>> things would not degenerate into so much bikescheding, that would already
>>>>> be done).
>>>>>
>>>>>
>>>>> Calling out “bikescheding” what other devs think is what keeps the
>>>>> quality of the project high is unlikely to help your patch go through, it’s
>>>>> probably quite the opposite actually.
>>>>>
>>>>>
>>>>>
>>>> I understand the desire to keep quality high. That's is not where the
>>>> problem is. The problem lies into discussing actual proposal against
>>>> hypothetical perfect ones that do not exists.
>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.cs.uiuc.edu&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=XQPhtYoenE_8aGjkPFg5qwxjM_C1CvJzloFkwo03VbM&e=>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQMFaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=UIvJWH1sfAjqtDn-zhUTQftRiaPqHmuoDU98fROzDfg&s=88-nGhQnI-go7arn8nxF4F1rk-cz3L_uwsFS5FD8kzc&e=>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150817/496e112b/attachment-0001.html>