[PATCH] D94308: [MachineSink] SinkIntoLoop: analyse stores and aliases in between

Tue Feb 2 03:14:06 PST 2021

SjoerdMeijer added a comment.

Thanks for reviewing!
Find replies inlined to your comments/questions. I am now addressing the issue that all blocks in the loop should be analysed as remarked previously.

================
Comment at: llvm/test/CodeGen/AArch64/loop-sink.mir:969-970
   ; CHECK:   [[PHI2:%[0-9]+]]:gpr32 = PHI [[COPY]], %bb.1, %4, %bb.3
+  ; CHECK:   [[ADRP:%[0-9]+]]:gpr64common = ADRP target-flags(aarch64-page) @A
+  ; CHECK:   [[LDRWui:%[0-9]+]]:gpr32 = LDRWui killed [[ADRP]], target-flags(aarch64-pageoff, aarch64-nc) @A :: (dereferenceable load 4 from `i32* getelementptr inbounds ([100 x i32], [100 x i32]* @A, i64 0, i64 0)`)
   ; CHECK:   [[SDIVWr:%[0-9]+]]:gpr32 = SDIVWr [[PHI2]], [[LDRWui]]
----------------
dmgreen wrote:
> SjoerdMeijer wrote:
> > dmgreen wrote:
> > > Some of these look very expensive to avoid a stack reload. Is the reload not cheaper, even if it is in the loop?
> > Yep, that might very well be the case. But this is about building up the framework to support loop sinking which requires quite a bit of surgery: after this analysis/correctness patch, we indeed need to work on the cost-model/profitability (your remark), we may need options to force or limit the number of sinks, and then we need to enable this by default. Thus, quite a few steps to go, and cost analysis will be next after this patch.
> Do you have the cost part yet? It's difficult to see if this will be a good thing to do without the complete picture. I guess having the code to do it shouldn't hurt in any case.
> 
> Naively it would seem to trade one stack reload in the loop for a sunk instruction and the possibly the stack reload of it's arguments. And this appears to be AArch64 - a system with a lot of registers and comparatively cheap stack reloads. On v6m it would be much more likely to come up with the limited amount of registers available. Without being careful it has a high chance of making things worse, not better!
> It's difficult to see if this will be a good thing to do without the complete picture. 

I think the complete picture does not exist. Or, there are many complete pictures, thus I don't think I can meaningfully sketch this or claim to have provided this. The general picture is that we have aggressive hoisting as a canonicalisation transformation, but that's the only mode that we have, and there's nothing in between, please see also my commit message rG48ecba350ed6 of the groundwork of this. This has links to discussions on the llvm dev mailing list, where this has been discussed several times and highlighted as a generic LLVM issue. This is adding the framework to at least be able to support this.

> Do you have the cost part yet?

Nope, not yet. My motivating example is in D92488, which has actually been massively reduced. I think there's low hanging fruit here where we can easily do better. 

> Naively it would seem to trade one stack reload in the loop for a sunk instruction and the possibly the stack reload of it's arguments. And this appears to be AArch64 - a system with a lot of registers and comparatively cheap stack reloads. On v6m it would be much more likely to come up with the limited amount of registers available. Without being careful it has a high chance of making things worse, not better!

Yep, sure, this needs careful tuning, like most other things. This is what we should do next.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94308/new/

https://reviews.llvm.org/D94308