[Patch][ARM] Fix and enable the Load/Store optimisation pass for Thumb1

Wed May 14 04:55:30 PDT 2014

Hi James,

Thanks a lot - I've now also made a Phabricator diff for the main patch (0003) at http://reviews.llvm.org/D3757 for further code review. Any feedback is of course welcome!

Cheers
Moritz
________________________________________
From: James Molloy
Sent: 14 May 2014 12:50
To: Moritz Roth; llvm-commits at cs.uiuc.edu
Cc: 'Renato Golin'; t.p.northover at gmail.com
Subject: RE: [Patch][ARM] Fix and enable the Load/Store optimisation pass for Thumb1

Hi Moritz,

Thanks for working on this! Lack of LDM/STM on v6m is pretty bad for
performance across the board, not just on Dhrystone.

For the list's benefit, as this is Moritz's first LLVM patch I reviewed this
internally to the point of being happy with it. I personally think it's fine
to be committed, although on a second look I will no doubt find nits that I
missed first time round.

Would anyone (Renato? Tim?) like to review this further or is my LGTM
enough?

Cheers,

James

> -----Original Message-----
> From: Moritz Roth
> Sent: 14 May 2014 11:33
> To: llvm-commits at cs.uiuc.edu
> Cc: James Molloy
> Subject: [Patch][ARM] Fix and enable the Load/Store optimisation pass for
> Thumb1
>
> Hi all,
>
> this is a set of patches to add support for Thumb1 targets in the
> Load/Store optimisation pass, and re-enable that pass as well as inline
> memcpy expansion.
> Below is a short description of each patch:
>
> 0001 - This patch fixes a few comment typos and other style issues I
> addressed while working on this. It's fairly small, and there is no
> intended functionality change.
>
> 0002 - This patch re-enables the Load/Store optimisation pass for
> Thumb1-only targets. Since the actual change to the algorithm isn't in
> this patch yet, the pass simply returns and does nothing if invoked for
> such a target. Essentially, the place where the pass is disabled for
> Thumb1 is just moved down into the actual pass, so patch 0003 can easily
> make it *actually* do something. Again, there is no intended
> functionality change.
>
> 0003 - This is the main patch - it adds support in the Load/Store
> optimisation pass to correctly generate Thumb1 LDMIA/STMIA instructions
> and fully enables the pass.
> The reason this was disabled before is that the current algorithm always
> generates non-writeback Load/Store multiples first, and then tries to
> merge any applicable base register updates into the LDM/STM. Thumb1 only
> has LDM/STM with base register writeback, so this approach doesn't
> really work there. In a nutshell, my patch directly generates the Thumb1
> tLDMIA[_UPD] and tSTMIA_UPD instructions. It then scans over the current
> block and tries to update any future instructions that read the base
> register with the new offset added from the writeback. If this isn't
> possible, the base register is reset right before the next instruction
> that uses it. The later (base-writeback merge) stages of the pass aren't
> applicable to Thumb1, so they're not executed.
>
> This is a rather large patch and there are many details I've left out
> here. I'll put a more detailed description of the changes on Phabricator
> for review shortly.
> There is no intended functionality change for non-Thumb1 targets. I've
> added some tests to check that the pass is working - but note that there
> is another set of test cases for this (and memcpy expansion) in patch
> 0004. There's also a fix for a failing test where two instructions were
> being merged by the algorithm.
>
> 0004 - This patch re-enables inline memcpy expansion for Thumb1. It was
> disabled for Thumb1 since the Load/Store optimisation pass was disabled.
> There are also test cases to make sure that small memcpys are inlined,
> and that the resulting chains of LDR/STR are merged correctly into
> LDM/STM (see patch 0003). This patch should only be applied once 0003 is
> commited.
>
> Finally, regarding code size / performance impact: This patch has an
> impact on certain benchmarks that do lots of memcpy. By itself, it seems
> to give a ~7% improvement in Dhrystone. Together with some trickery to
> make clang align global strings at word boundaries (this allows a
> further memcpy to be inlined), there's a ~25% overall speed-up.
>
> Cheers
> Moritz
>
> PS: Sorry for the disclaimer, still working on getting that removed from
> my work email account.

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782