[PATCH] D37698: Add target callback doMemOpsHaveSameBasePtr
    Stanislav Mekhanoshin via Phabricator via llvm-commits 
    llvm-commits at lists.llvm.org
       
    Tue Sep 12 12:12:26 PDT 2017
    
    
  
rampitec marked 4 inline comments as done.
rampitec added inline comments.
================
Comment at: lib/Target/AMDGPU/SIInstrInfo.cpp:360
 
+bool SIInstrInfo::doMemOpsHaveSameBasePtr(const MachineInstr &MI1,
+                                          unsigned BaseReg1,
----------------
rampitec wrote:
> hfinkel wrote:
> > rampitec wrote:
> > > hfinkel wrote:
> > > > Can you make *this* the default implementation?
> > > This will result in both improvements an regressions across the board as it will increase amount of clustering dramatically. Having no access to all HW and benchmark suites and targets I left the default behavior intact. In particular I had to limit clustering in amdgpu below to maintain reasonable average performance, which I cannot do for all other targets.
> > > 
> > > I.e. I would be more than happy to  make it default behavior, but I think much more people should agree to spend their time tuning it after for their targets.
> > > 
> > > What we can do, if this is eventually clonned to the majority of targets requiring clustering, then change the default.
> > > 
> > > What others think? Is this a generally desired default behaviour?
> > If we generally adopted your current methodology, then we'd very rarely get any improvements to the common components. We need to do things that generally make sense. People are encouraged to benchmark them on a variety of targets, and we'll often give them a heads up (on llvm-dev and/or pinging target code owners), but otherwise, they'll notice post-commit if things regress. That having been said...
> > 
> > As currently defined by this patch, this implementation is how I'd recommend implementing the target-independent code for the callback. The question is then: is this the right callback to add? Is it too general? Do you think that, some tuning aside, this is the right thing to do on most/all relevant targets? Do we actually want this callback or do we want just a Boolean switch (this behavior vs. the old behavior)? [I can certainly imagine hardware store combining where the base register actually matters, and I can also imagine hardware where the address is all that matters, but I'd assume most modern hardware wants the latter.]
> > 
> In general there are several questionable things about it if we generalize it for all targets:
> 
> 1. It does not respect offset distance, which may be important. I.e. some targets may only want it if they hit the same cacheline. It does not apply to amdgpu because we always want to form so called "memory clause", a batch of memory instructions.
> 
> 2. It is not a fastest code out there as it uses GetUnderlyingObject.
> 
> 3. Some targets may still want to implement it for the sake of further combining loads, but relying on memory operands that may not be generally possible to do.
> 
> 4. That is unclear if all targets maintain reasonable memory operands for this to be useful, however it still compares base registers as its first resort.
> 
> All of that does not prohibit it to be a default implementation though if can be overwritten.
D37755 moves AMDGPU implementation into core.
https://reviews.llvm.org/D37698
    
    
More information about the llvm-commits
mailing list