[LLVMdev] RFC: Missing canonicalization in LLVM

Thu Apr 30 12:35:17 PDT 2015

After trying various means of undoing the canonicalization at the SROA pass, I’m thinking an easier/better approach is to simply be much more conservative when doing the original canonicalization in the first place. My proposal is

if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {

          auto *SI = dyn_cast<StoreInst>(U);

          return SI && SI->getPointerOperand() != &LI &&

                 SI->getPointerOperand()->hasOneUse();

        })) {

where the addition of checking to ensure the StoreInst’s pointer operand is only used once. What I’ve found is that when it is used multiple times, it is usually due to the demotion of phis into loads/stores. There are multiple of them because several blocks are using the storage location essentially as a phi. I originally tried to ignore the cases where the loads only stored during SROA. But that really doesn’t work unless you follow all the other places that also load from those same locations, accounting for all the inserted bitcasts. It got way too messy trying to catch all the cases, and it still wasn’t able to undo what needed to be undone. 

I’ve found I can get back most all the regression loss I see by being more conservative at the canonicalization step.

Would this still meet the cases you originally wrote this canonicalization for Chandler?

Daniel

From: Chandler Carruth [mailto:chandlerc at google.com] 
Sent: Thursday, April 23, 2015 1:42 PM
To: Daniel Stewart; Pete Cooper
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

FYI, on vacation and then at a conference but will actually look at this on Monday next week.

On Thu, Apr 23, 2015, 18:31 Daniel Stewart <stewartd at codeaurora.org <mailto:stewartd at codeaurora.org> > wrote:

I think I can solve this by both adding BitCast checks to the loads, in addition to the Stores, and also checking the Stores to ensure they are fed by a Load and that the Load only feeds it. I’ll test this solution some more and try to make a patch. 

Daniel

From:  <mailto:llvmdev-bounces at cs.uiuc.edu> llvmdev-bounces at cs.uiuc.edu [mailto: <mailto:llvmdev-bounces at cs.uiuc.edu> llvmdev-bounces at cs.uiuc.edu] On Behalf Of Daniel Stewart
Sent: Thursday, April 23, 2015 9:17 AM

To: 'Pete Cooper'
Cc: 'LLVM Developers Mailing List'
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

Thanks for the reply Pete. Unfortunately, I don’t think it is going to be as simple as ignoring those loads which only store. In findCommonType(), only one alloca is passed in at a time. So, while you could find those cases where that alloca was loaded from and stored elsewhere, you can’t find those places that store to that alloca from somewhere else (at least not easily that I can see). 

So in my particular example, I could catch the case of only load -> store. However, there are other stores that use the alloca address, but it is not readily apparent if they come directly & only from a load. 

Still trying to figure out the best way forward.

Daniel

From: Pete Cooper [ <mailto:peter_cooper at apple.com> mailto:peter_cooper at apple.com] 
Sent: Tuesday, April 21, 2015 2:00 PM
To: Daniel Stewart
Cc: LLVM Developers Mailing List; Chandler Carruth
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

Hi Daniel

Thanks for the excellent breakdown of whats going on here.

Earlier in the thread on this I made this comment:

"The first thing that springs to mind is that I don’t trust the backend to get this right.  I don’t think it will understand when an i32 load/store would have been preferable to a float one or vice versa.  I have no evidence of this, but given how strongly typed tablegen is, I don’t think it can make a good choice here.

So I think we probably need to teach the backend how to undo whatever canonical form we choose if it has a reason to”

Without seeing the machine instructions, its hard to be 100% certain, but the case you’ve found may be simple enough that the backend can actually fix this. However, such a fixup would be quite target specific (such a target would need different register classes for integers and doubles in this case), and we’d need such a pass for all targets which isn’t ideal.

So i wouldn’t rule out a backend solution, but i have a preference for your suggestion to improve SROA.

In this particular case, it makes sense for SROA to do effectively the same analysis InstCombine did here and work out when a load is just raw data vs when its data is used as a specific type.  The relevant piece of InstCombine is this:

// Try to canonicalize loads which are only ever stored to operate over

  // integers instead of any other type. We only do this when the loaded type

  // is sized and has a size exactly the same as its store size and the store

  // size is a legal integer type.

  if (!Ty->isIntegerTy() && Ty->isSized() &&

      DL.isLegalInteger(DL.getTypeStoreSizeInBits(Ty)) &&

      DL.getTypeStoreSizeInBits(Ty) == DL.getTypeSizeInBits(Ty)) {

    if (std::all_of(LI.user_begin(), LI.user_end(), [&LI](User *U) {

          auto *SI = dyn_cast<StoreInst>(U);

          return SI && SI->getPointerOperand() != &LI;

        })) {

...

After ignoring load/stores which satisfy something like the above code, you can always fallback to the current code of choosing an integer type, so in the common case there won’t be any behavior difference.

Cheers

Pete

On Apr 21, 2015, at 9:18 AM, Daniel Stewart <stewartd at codeaurora.org <mailto:stewartd at codeaurora.org> > wrote:

So this change did indeed have an effect! :)

I’m seeing regressions in a number of benchmarks mainly due to a host of extra bitcasts that get introduced. Here’s the problem I’m seeing in a nutshell:

1)      There is a Phi with input type double

2)      Polly demotes the phi into a load/store of type double

3)      InstCombine canonicalizes the load/store to use i64 instead of double

4)      SROA removes the load/store & inserts a phi back in, using i64 as the type. Inserts bitcast to get to double.

5)      The bitcast sticks around and eventually get translated into FMOVs (for AArch64 at least).

The function findCommonType() in SROA.cpp is used to obtain the type that should be used for the new alloca that SROA wants to create. It’s decision process is essentially – if all loads/stores of alloca are the same, use that type; else use the corresponding integer type. This causes bitcasts to be inserted in a number of places, most all of which stick around. 

I’ve copied a reduced version of an instance of the problem below. I’m looking for comments on what others think is the right solution here. Make SROA more intelligent about picking the type? 

The code is below with all unnecessary code removed for easy consumption. 

Daniel

Before Polly – Prepare code for polly we have code that looks like:

while.cond473:                                    ; preds = %while.cond473.outer78, %while.body475

  %p_j_x452.0 = phi double [ %105, %while.body475 ], [ %p_j_x452.0.ph82, %while.cond473.outer78 ]

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0

  %105 = load double* %x485, align 8, !tbaa !25

After Polly – Prepare code for polly we have:

while.cond473:                                    ; preds = %while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %64, %p_j_x452.0.reload

  %110 = load double* %x485, align 8, !tbaa !25

  store double %110, double* %p_j_x452.0.reg2mem

After Combine redundant instructions :

while.cond473:                                    ; preds = %while.cond473.outer78, %while.body475

  %p_j_x452.0.reload = load double* %p_j_x452.0.reg2mem, align 8

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %74, %p_j_x452.0.reload

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0

  %194 = bitcast double* %x485 to i64*

  %195 = load i64* %194, align 8, !tbaa !25

  %200 = bitcast double* %p_j_x452.0.reg2mem to i64*

  store i64 %195, i64* %200, align 8

After SROA :

while.cond473:                                    ; preds = %while.cond473.outer78, %while.body475

  %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 = phi i64 [ %p_j_x452.0.ph73.reg2mem.sroa.0.0.load368, %while.cond473.outer78 ], [ %178, %while.body475 ]

  %173 = bitcast i64 %p_j_x452.0.reg2mem.sroa.0.0.p_j_x452.0.reload362 to double

while.body475:                                    ; preds = %while.cond473

  %sub480 = fsub fast double %78, %173

  %x485 = getelementptr inbounds %struct.CompAtom* %15, i64 %idxprom482, i32 0, i32 0

  %177 = bitcast double* %x485 to i64*

  %178 = load i64* %177, align 8, !tbaa !25

From:  <mailto:llvmdev-bounces at cs.uiuc.edu> llvmdev-bounces at cs.uiuc.edu [ <mailto:llvmdev-bounces at cs.uiuc.edu> mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chandler Carruth
Sent: Wednesday, January 21, 2015 8:32 PM
To: Pete Cooper
Cc: LLVM Developers Mailing List
Subject: Re: [LLVMdev] RFC: Missing canonicalization in LLVM

On Wed, Jan 21, 2015 at 3:06 PM, Pete Cooper < <mailto:peter_cooper at apple.com> peter_cooper at apple.com> wrote:

Sounds good to me.  Integers it is then.

FYI, thanks, I'm just going to commit this then. It seems we're all in essential agreement. We can revert it and take a more cautious approach if something terrible happens. =]

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>          http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150430/5eb7fcca/attachment.html>