[PATCH] D83626: [AMDGPU/MemOpsCluster] Guard new mem ops clustering heuristic logic by a flag
Mahesha S via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Jul 13 13:01:55 PDT 2020
hsmhsm marked an inline comment as done.
hsmhsm added inline comments.
================
Comment at: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:562
+ }
+ return NumLoads <= MaxNumLoads;
}
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > hsmhsm wrote:
> > > > hsmhsm wrote:
> > > > > rampitec wrote:
> > > > > > You have problem with extremely wide loads. I am not sure what was in the regression case, but probably something like 8 longs or so. Isn't it better to tweak it instead and just clamp based on the NumBytes as it supposed to be? You are saying you are checking NumBytes, but the return is solely based on NumLoads.
> > > > > Let's consder your example as you replied to my email:
> > > > >
> > > > > Stas' Reply:
> > > > >
> > > > >
> > > > > ```
> > > > > Now say you have 4 incoming loads of <8 x i64>. According to your logic: this is 8*64*4/8 = 256 bytes, MaxNumLoads = 4. You would say it is OK to cluster, because 4 <= 4 == true.
> > > > > Now say you have 8 loads of <8 x i32>. It is the same 256 bytes, but you inhibit clustering because 8 <= 4 == false.
> > > > >
> > > > > It is not really based on the number of bytes, it is based on the number of loads.
> > > > > ```
> > > > >
> > > > >
> > > > >
> > > > > As per my understanding, the crux of the logic is this - When the clustered bytes is even large but it is shared by very less number of load instructions, then consider them for clustering. It is where the real usage of clustered bytes plays a role.
> > > > >
> > > > > From above sence, your first example is ok for the heuristic, but not the second example.
> > > > And, in any case, I am not sure what you mean by - "just clamp based on the NumBytes as it supposed to be". I welcome your ideas in bit detail so that I understand it in detail, and it helps me to quickly experiment with it.
> > > >
> > > > Also, reworking the heuristic problably means again run all the must performance benchmarks, and make sure, we are not regressing there, and it means, the patch will take its own time before it is commited to master.
> > > I mean you can allow extra wide loads with your logic, which is not desired. I suggest to limit it. Did you look at the regression IR?
> > >
> > > Then if you afraid of further regressions and think more testing is needed, just revert the change lead to regression.
> > My honest openion is this - My hands are now tied to other urgent issues, and I do not have any time left to think about it. So safe bet now is just continue with old logic as it is. Since any way, we are not permanently reverting the change, and we are just disabling it, we can come back to it once I get some breathing time, and re-work it sooner than later.
> It simply does not do what you wrote it does. You wrote it makes a decision based on a number of bytes. It does not.
I had to come-up with, this heuristic in order satisfy all the workloads, and more importantly, to make sure that the workload (rocSpARSE) for which, the work was primarily undertaken will work as expected.
But, yes, I think, I am missing some important key insights here.
Any change now means re-evalaute everything from scratch. Also, it is not going to be very strighforward change, considering all the workloads.
So, let's disable it for the time being.
If you are otherwise fine with this patch, let's first merge this patch, so that it won't blocks some key projects.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D83626/new/
https://reviews.llvm.org/D83626
More information about the llvm-commits
mailing list