[PATCH] D37698: Add target callback doMemOpsHaveSameBasePtr

Wed Sep 13 02:05:30 PDT 2017

hfinkel added inline comments.

================
Comment at: lib/Target/AMDGPU/SIInstrInfo.cpp:360

+bool SIInstrInfo::doMemOpsHaveSameBasePtr(const MachineInstr &MI1,
+                                          unsigned BaseReg1,
----------------
rampitec wrote:
> rampitec wrote:
> > hfinkel wrote:
> > > rampitec wrote:
> > > > hfinkel wrote:
> > > > > Can you make *this* the default implementation?
> > > > This will result in both improvements an regressions across the board as it will increase amount of clustering dramatically. Having no access to all HW and benchmark suites and targets I left the default behavior intact. In particular I had to limit clustering in amdgpu below to maintain reasonable average performance, which I cannot do for all other targets.
> > > > 
> > > > I.e. I would be more than happy to  make it default behavior, but I think much more people should agree to spend their time tuning it after for their targets.
> > > > 
> > > > What we can do, if this is eventually clonned to the majority of targets requiring clustering, then change the default.
> > > > 
> > > > What others think? Is this a generally desired default behaviour?
> > > If we generally adopted your current methodology, then we'd very rarely get any improvements to the common components. We need to do things that generally make sense. People are encouraged to benchmark them on a variety of targets, and we'll often give them a heads up (on llvm-dev and/or pinging target code owners), but otherwise, they'll notice post-commit if things regress. That having been said...
> > > 
> > > As currently defined by this patch, this implementation is how I'd recommend implementing the target-independent code for the callback. The question is then: is this the right callback to add? Is it too general? Do you think that, some tuning aside, this is the right thing to do on most/all relevant targets? Do we actually want this callback or do we want just a Boolean switch (this behavior vs. the old behavior)? [I can certainly imagine hardware store combining where the base register actually matters, and I can also imagine hardware where the address is all that matters, but I'd assume most modern hardware wants the latter.]
> > > 
> > In general there are several questionable things about it if we generalize it for all targets:
> > 
> > 1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.
> > 
> > 2. It is not a fastest code out there as it uses GetUnderlyingObject.
> > 
> > 3. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
> > 
> > 4. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.
> > 
> > All of that does not prohibit it to be a default implementation though if can be overwritten.
> D37755 moves AMDGPU implementation into core.
To comment on the conditions:

> In general there are several questionable things about it if we generalize it for all targets:

> 1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.

This is fair, although the current clustering logic does not consider this.

> 2. It is not a fastest code out there as it uses GetUnderlyingObject.

We already essentially call this once per MMO-carrying instruction in ScheduleDAGInstrs.cpp. Doing it again here does not seem so bad. Unless we're now doing it O(N^2) times. Are we?

> 3. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
> 4. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.

Targets should maintain reasonable MMOs, and as far as I know, all (in tree) targets now do.

================
Comment at: lib/Target/AMDGPU/SIInstrInfo.cpp:360

+bool SIInstrInfo::doMemOpsHaveSameBasePtr(const MachineInstr &MI1,
+                                          unsigned BaseReg1,
----------------
hfinkel wrote:
> rampitec wrote:
> > rampitec wrote:
> > > hfinkel wrote:
> > > > rampitec wrote:
> > > > > hfinkel wrote:
> > > > > > Can you make *this* the default implementation?
> > > > > This will result in both improvements an regressions across the board as it will increase amount of clustering dramatically. Having no access to all HW and benchmark suites and targets I left the default behavior intact. In particular I had to limit clustering in amdgpu below to maintain reasonable average performance, which I cannot do for all other targets.
> > > > > 
> > > > > I.e. I would be more than happy to  make it default behavior, but I think much more people should agree to spend their time tuning it after for their targets.
> > > > > 
> > > > > What we can do, if this is eventually clonned to the majority of targets requiring clustering, then change the default.
> > > > > 
> > > > > What others think? Is this a generally desired default behaviour?
> > > > If we generally adopted your current methodology, then we'd very rarely get any improvements to the common components. We need to do things that generally make sense. People are encouraged to benchmark them on a variety of targets, and we'll often give them a heads up (on llvm-dev and/or pinging target code owners), but otherwise, they'll notice post-commit if things regress. That having been said...
> > > > 
> > > > As currently defined by this patch, this implementation is how I'd recommend implementing the target-independent code for the callback. The question is then: is this the right callback to add? Is it too general? Do you think that, some tuning aside, this is the right thing to do on most/all relevant targets? Do we actually want this callback or do we want just a Boolean switch (this behavior vs. the old behavior)? [I can certainly imagine hardware store combining where the base register actually matters, and I can also imagine hardware where the address is all that matters, but I'd assume most modern hardware wants the latter.]
> > > > 
> > > In general there are several questionable things about it if we generalize it for all targets:
> > > 
> > > 1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.
> > > 
> > > 2. It is not a fastest code out there as it uses GetUnderlyingObject.
> > > 
> > > 3. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
> > > 
> > > 4. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.
> > > 
> > > All of that does not prohibit it to be a default implementation though if can be overwritten.
> > D37755 moves AMDGPU implementation into core.
> To comment on the conditions:
> 
> > In general there are several questionable things about it if we generalize it for all targets:
> 
> > 1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.
> 
> This is fair, although the current clustering logic does not consider this.
> 
> > 2. It is not a fastest code out there as it uses GetUnderlyingObject.
> 
> We already essentially call this once per MMO-carrying instruction in ScheduleDAGInstrs.cpp. Doing it again here does not seem so bad. Unless we're now doing it O(N^2) times. Are we?
> 
> > 3. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
> > 4. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.
> 
> Targets should maintain reasonable MMOs, and as far as I know, all (in tree) targets now do.
> 
> D37755 moves AMDGPU implementation into core.

Thanks for posting the follow-up patch, but I don't think that we should do that. Looking at how AArch64 uses the existing implementation, I do think that we should rename the callback to be something specific to store clustering.

As far as I can tell from the discussion in D18376, the store clustering, as used by AArch64, is specifically setup to better enable the AArch64LoadStoreOptimizer, and that pass is specifically looking for stores that share the same base register (within some small window, which is why this helps).

As a result, I think it would be good to make the callback specific to store clustering.

https://reviews.llvm.org/D37698