[PATCH] D146425: [SystemZ] Enable AtomicExpandPass for i128

Tue Apr 4 02:54:55 PDT 2023

jonpa added a comment.

With full benchmark runs on a few tests I see:

  Improvements "3/50":
  0.994: i523.xalancbmk_r 

  Regressions "3/50":
  1.005: f526.blender_r 
  1.004: i502.gcc_r 
  1.003: i505.mcf_r 
  1.002: f507.cactuBSSN_r 

  Improvements "no preg clobbers / no search lim":
  1.000: i502.gcc_r 

  Regressions "no preg clobbers / no search lim":
  1.006: f526.blender_r 
  1.005: f507.cactuBSSN_r 
  1.005: i505.mcf_r 
  1.004: i523.xalancbmk_r 

These are small variations. Looking at the spill/COPY counts, it is a bit of surprise with 'blender', but generally increased spilling is of course not good:

  main <> "no preg clobbers / no search lim"

  f526.blender_r
  Spill|Reload   :                90817                90761      -56
  Copies         :               208906               208506     -400

  f507.cactuBSSN_r
  Spill|Reload   :               148023               148036      +13
  Copies         :                59083                59083       +0

  i505.mcf_r
  Spill|Reload   :                  702                  706       +4
  Copies         :                  562                  565       +3

  i523.xalancbmk_r
  Spill|Reload   :                14979                15029      +50
  Copies         :               134345               134185     -160

So it seems that we may get a few very slight regressions by rewriting shouldCoalesce(). Being a little careful with a search limit (at 50) helps a little, but not much. We could:

- keep it simple and accept the slight regressions above.
- keep the original version with the special case handling for the CDSG loops added, which have intervals spanning multiple blocks.
- constrain this new version further to get results closer to the unmodified version.

It depends a little on what behavior we think would be ideal. Normally coalescing is "good", but coalescing a GR128 live-range with a subreg will make the whole liverange 128bits, which costs a GR64 register. On the other hand, that GR64 subreg usage is typically just a short extension (at least per my old notes from 2017) of the live interval, so it may therefore still be better to avoid a register move.

To me, it seems that we probably want to coalesce away the COPY when it is just a matter of a use a few instructions later, which is the common case. We also want to coalesce the 4 COPYs inside the LSDG loop which result from the PHI lowering. Prolonging the GR128 interval any long distance is probably not worthwhile as it costs a full GR64 register. That is why the search limit is a good idea to me not only for the sake of a worst case compile time issue (which there is not on SPEC).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146425/new/

https://reviews.llvm.org/D146425