Add a pass to convert aggregate loads/stores into scalar load stores

Thu Oct 30 22:11:29 PDT 2014

----- Original Message -----
> From: "Reid Kleckner" <rnk at google.com>
> To: "Hal J. Finkel" <hfinkel at anl.gov>
> Cc: "Philip Reames" <listmail at philipreames.com>, "Chandler Carruth" <chandlerc at google.com>, "llvm-commits"
> <llvm-commits at cs.uiuc.edu>
> Sent: Thursday, October 30, 2014 7:34:17 PM
> Subject: Re: Add a pass to convert aggregate loads/stores into scalar load stores
> 
> You can also model the clobbering of padding in aggregate stores with
> store of undef.

I don't think that introducing stores is a good idea.

> 
> I don't like the memcpy approach because we introduce a new memory
> temporary that wasn't present in the input LLVM IR, and then rely on
> SROA to zap it for us later. I think we should skip that step and
> transform aggregate-load+extractvalue directly into scalar loads.

SROA may or may not zap it later, but SROA will also tend to transform things into wide loads, not individual ones.

The underlying issue, as Chandler has pointed out several times in various contexts, is that once you break a load up into smaller loads, you often lose the knowledge that the entire range of bytes is accessible. This knowledge is important when we get to CodeGen because it allows us to use wider loads which are often more efficient. The canonical form should not lose this information, and so we don't want to split up the load as the canonical choice.

Introducing an extra temporary does not particularly bother me, but the key thing is to stay as close as possible to the idiom that Clang is using because that is the idiom that I know we need to focus on optimizing well. Aligning our canonical form with the IR that Clang produces seems like the right thing to do in this case... it will maximally allow users of FCAs to benefit from the optimization philosophy employed by the rest of the system.

Thanks again,
Hal

> 
> 
> On Thu, Oct 30, 2014 at 4:53 PM, Finkel, Hal J. < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> 
> 
> Good point. You can add struct tbaa to the memcpy to preserve
> padding.
> 
> 
> -Hal
> 
> 
> Sent from my Verizon Wireless 4G LTE DROID
> 
> 
> 
> Reid Kleckner < rnk at google.com > wrote:
> 
> 
> 
> 
> 
> 
> 
> On Thu, Oct 30, 2014 at 2:08 PM, Philip Reames <
> listmail at philipreames.com > wrote:
> 
> 
> 
> 
> 
> 
> 
> On 10/28/2014 09:27 AM, Chandler Carruth wrote:
> 
> 
> 
> 
> 
> 
> On Tue, Oct 28, 2014 at 6:55 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> Chandler,
> 
> Can you please look at this? How do you think we should canonicalize
> this (is this the right approach)?
> Oof... yea.
> 
> 
> I wish I were more confident of what the "right" answer is here any
> more. =[ At the dev conference, Hal and I talked about a couple of
> options. I'm sure which is the "right" one, so let me lay out what I
> remember of that discussion.
> 
> Option 1 - Leave the load alone, improve GVN
> - not necessarily a bad option, but slightly risky in that it
> involves changes to key infrastructure with little in tree
> motivation
> - the original change doing this was rejected
> 
> Option 2 - Transform to load of component element types
> - tricky to get layout exactly right, but definitely doable
> 
> Option 3 - Transform to load of iN where N is sizeof(agg)*8.
> 
> Option 4 - Transform to series of smaller integer loads
> - This appears to be what this patch implements. Not entirely sure
> why this was chosen.
> 
> Option 5 - Transform to alloca and memcpy
> - Clang appears to emit loads of structs via an alloca (for the
> local) and a memcpy. The optimizer deconstructs this where
> appropriate.
> - I have no idea why Clang chose this option. My best guess is to
> exploit information about POD types?
> 
> Personally, I'd lean towards 5,1,2 above (in roughly that order). 1 &
> 2 seem like better long term solutions, but 5 probably works fairly
> well today. I'm not really a fan of either 3 or 4 since we loose
> things like distinctions between pointers and integers. If we had to
> choose, I'd take 3 over 4.
> 
> I think we also discussed the trade off between a pass and an
> instcombine transform. I would lean towards making whichever option
> we chose a canonicalization rule in instcombine.
> 
> Also, this patch implements option 2 for a struct with a single
> element type which seems like a (independently) useful
> canonicalization. Should we introduce that transform as a
> canonicalization in instcombine? I'd lean towards that.
> 
> Chandler, Hal - Thoughts, opinions?
> 
> 
> 
> 2 seems like the best option to me. When you load and store an FCA,
> you don't get to copy the padding of the struct with you. Once it's
> loaded, each element is its own value. There's no way to recover the
> padding. memcpy doesn't represent this.
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory