[PATCH] D146425: [SystemZ] Enable AtomicExpandPass for i128

Mon Apr 3 05:27:38 PDT 2023

jonpa added a comment.

In D146425#4237014 <https://reviews.llvm.org/D146425#4237014>, @uweigand wrote:

>> In theory, of course it would have been very nice if the register allocator would not run out of registers (still does without this), and also if it could split ("uncoalesce") as needed to minimize spilling. It does unfortunately not, so there remains the option to try to avoid spilling in shouldCoalesce(). Should we keep it simple here, or should we experiment a little and try to at least not increase spilling compared to main? Not sure if there is a natural heuristic that includes the CDSG loop without explicitly checking for that case...
>
> While the raw spill counts give some indication, of course, it would be good to also look at actual performance impact of the change.   A few inline comments about the heuristics ...
>
> Also, shouldn't there be some generic heuristics of whether or not coalescing is worthwhile?   It seems to me this would always be a tradeoff for any register class, not just for the 128-bit pairs?

> Does it help tweaking this heuristic a bit? What if we use 4 or 2 instead of 3?
> This introduces yet another weird heuristics. Is this even necessary at all? What are the compile-time impacts of just not doing this check?

I looked into the compile time with various settings, and found that it does not really seem to make much difference what we do in shouldCoalesce():

                                          Average Wall:    3 worst:
  main                                    1.75%            38.3%, 20.4%, 19.3%
  main, no SystemZ shouldCoalesce()       1.75%            38.0%, 20.2%, 19.4%
  Patch (3 / 50)                          1.76%            38.6%, 20.6%, 19.4%
  Patch (3 / nolim)                       1.76%            38.2%, 20.3%, 19.5%
  Patch (2 / 50)                          1.75%            38.8%, 20.4%, 19.3%
  Patch (6 / nolim)                       1.76%            38.3%, 20.3%, 19.4%
  Patch (7 / nolim)                       1.76%            38.1%, 20.4%, 19.5%

So it is possible to not do the check for compile time concerns, it seems.

This however also affects the number of coalesces done, and thereby the spilling:

  main <> "3 / 50"
  Spill|Reload   :               635680               636698    +1018
  Copies         :              1010870              1010271     -599

  main <> "2 / 50":  Identical to "3 / 50"
  main <> "4 / 50":  Identical to "3 / 50"
  main <> "5 / 50":  Very little difference to "3 / 50": -41 Spill|Reload / +18 Copies.

  main <> "3 / unlim"
  Spill|Reload   :               635680               642379    +6699
  Copies         :              1010870              1008390    -2480

  main <> "6 / unlim"
  Spill|Reload   :               635680               640658    +4978
  Copies         :              1010870              1009061    -1809

  main <> "7 / unlim"
  Spill|Reload   :               635680               639023    +3343
  Copies         :              1010870              1009719    -1151

It seems like the unlimited allows for more coalescing but also more spill/reload instructions. However, looking at the nightly benchmarking, I see:

  Overall results (by average over benchmarks):
  Build:                                                                    Average result
  2017_D_6_unlim                                                            99.787 %
  2017_E_7_unlim                                                            99.892 %
  2017_B_3_50                                                               99.953 %
  2017_C_3_unlim                                                            100.027 %

  Improvements 2017_D_6_unlim:
  0.976: i523.xalancbmk_r 
  0.981: i502.gcc_r 
  0.986: f526.blender_r 

  Regressions 2017_D_6_unlim:
  1.011: i505.mcf_r 

The "3 / 50" is "ok, but there may be some improvement actually by doing something like just having a check against 1 or 2 GR128 clobbers (and no search limit), like "D". This is still preliminary though ("mini").

I did not find anything general per your suggestion about this, but looking at Hexagon, I saw they are doing something similar which I tried:

    const SlotIndexes &Indexes = *LIS.getSlotIndexes();
    auto HasCall = [&Indexes] (const LiveInterval::Segment &S) {
      for (SlotIndex I = S.start.getBaseIndex(), E = S.end.getBaseIndex();
           I != E; I = I.getNextIndex()) {
        if (const MachineInstr *MI = Indexes.getInstructionFromIndex(I))
          if (MI->isCall())
            return true;
      }
      return false;
    };

    LiveInterval &DstInterval = LIS.getInterval(MI->getOperand(0).getReg());
    LiveInterval &SrcInterval = LIS.getInterval(MI->getOperand(1).getReg());
    return !any_of(DstInterval, HasCall) && !any_of(SrcInterval, HasCall);
  }

First, just for the i128 case, this gave:

  Spill|Reload   :               635680               639026    +3346
  Copies         :              1010870              1009736    -1134

This seems then pretty close to "7 / unlim".

Unfortunately however this doesn't help the regalloc enough on its own - it runs out of registers with an inline assembly (CodeGen/SystemZ/regalloc-GR128.ll).

Trying this on all register classes thinking maybe it would work out so that more COPYs remain, but the COPY hints in the regalloc helps enough? The answer is no:

  Spill|Reload   :               635680               640701    +5021
  Copies         :              1010870              1188763  +177893

Finally, trying the current main heuristic generally, meaning only coalescing intervals within the same MBB (which gives less spilling for GR128 compared to "3 / 50"):

  Spill|Reload   :               635680               638633    +2953
  Copies         :              1010870              1804531  +793661

As expected this is not a good idea. So the general aspects of this beyond GR128 remains unclear...

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146425/new/

https://reviews.llvm.org/D146425