[PATCH] D124984: [LoopCacheAnalysis] Use smaller numbers when calculating the costs (PR55233)

Mon May 16 13:08:38 PDT 2022

congzhe added a comment.

In D124984#3516334 <https://reviews.llvm.org/D124984#3516334>, @bmahjour wrote:

> Although the problem reported in PR55233 is worsened by D123400 <https://reviews.llvm.org/D123400>, the underlying issue existed even before that (as illustrated by the example in PR55233), so this patch won't completely solve the overflow problem, although it would make it a bit less likely to occur. I think a proper solution to PR55233 would need to address the overflow problem itself (eg by increasing the range of values that the cost model can represent).
>
> I had thought about this approach when working on D123400 <https://reviews.llvm.org/D123400>, and while I agree it will make the cost values smaller, I have some reservations about its accuracy.
>
> One observation is that with this patch, the cost values for non-consecutive accesses in deep nests become smaller, but also their relative differences reduce as well. For example in `single-store.ll`, the cost difference between for.k and for.j is in the orders of magnitude, while the cost difference between for.j and for.i is much smaller. I worry that this dilution of cost difference might make it harder to find the right order in loop nests that contain multiple reference groups. I think this approach slightly diverges from the concepts presented in the paper, inasmuch as, the unit of measurement for the cost value is meant to be number of cache lines, but with the multiplication by the "depth" factor the value will no longer be estimating number of cache lines.

Thanks Bardia for the comment! Regarding the test case in `loads-store.ll`: in D123400 <https://reviews.llvm.org/D123400> it should actually have been changed to "; CHECK: Loop 'for.i' has cost = 300000000" since the cost becomes 300000000 instead of 3000000. Currently the test still passes just because it happens to also match "Loop 'for.i' has cost = 3000000". After this patch the cost reduces from 300000000 to 6000000, which is consistent with the bahavior in other tests.

I agree that our long-term goal is to address the overflow problem itself possibly by increasing the range of values that the cost model can represent, and this patch might be a short-term solution. IMHO I think however we want to solve the problem, the numbers calculated by loop cache analysis will always be an estimation of some sort. We try to preserve its accuracy but might need to do some adjusments due to various reasons. For example, in `loads-store.ll`, the cost of "300000000" for loop "for.i" that we currently get is already not an accurate estimation of the accessed cache line numbers -- "3000000" might be a closer estimate but since we want to distinguish between outer loops, we adjusted "3000000" to "300000000.

I was hoping what this patch does aligns with the reasoning above. Nevertheless, I'm open to other solutions as well. I wonder if you feel like we should continue this patch or we should seek other solutions?

================
Comment at: llvm/lib/Analysis/LoopCacheAnalysis.cpp:332
+    const SCEV *DepthFactor =
+        SE.getConstant(WiderType, getNumSubscripts() - Index - 1);
+    RefCost = SE.getMulExpr(SE.getNoopOrAnyExtend(RefCost, WiderType),
----------------
bmahjour wrote:
> This will always give a factor of 0 for the inner-most subscript. Is that intentional?
Thanks, will update it in the next version.

================
Comment at: llvm/lib/Analysis/LoopCacheAnalysis.cpp:332
+    const SCEV *DepthFactor =
+        SE.getConstant(WiderType, getNumSubscripts() - Index - 1);
+    RefCost = SE.getMulExpr(SE.getNoopOrAnyExtend(RefCost, WiderType),
----------------
congzhe wrote:
> bmahjour wrote:
> > This will always give a factor of 0 for the inner-most subscript. Is that intentional?
> Thanks, will update it in the next version.
For innermost subscript usually the access is consecutive hence this "else" block won't be reached. If the access is not consecutive, the factor being 0 will still make the cost of innermost subscript to be smallest, which kind of works although not perfect.

An easy improvement/fix is to use something like `(getNumSubscripts() - Index - 1)==0?1:(getNumSubscripts() - Index - 1)`. If you think this is better I can upate it in the next version.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124984/new/

https://reviews.llvm.org/D124984