[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Mon Nov 27 05:31:45 PST 2017

kristof.beyls added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:290-294
+  case Interleave:
+    for (auto &I : IRT) {
+      OriginalMCID = &TII->get(I.OrigOpc);
+      for (auto &Repl : I.ReplOpc)
+        ReplInstrMCID.push_back(&TII->get(Repl));
----------------
az wrote:
> kristof.beyls wrote:
> > az wrote:
> > > kristof.beyls wrote:
> > > > This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
> > > > I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
> > > > I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).
> > > Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs.  The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.
> > > 
> > > Let me know if you have any more comments.
> > > 
> > > |Benchmark         Exynos-m1               cortex-a57          exynos-m1 (pass disabled)     cortex-a57 (pass disabled)|
> > > |7zip                        123.7240                   123.1800                     122.6360                            122.6640|
> > > |Bullet                       82.6360                     82.1200                       82.0840                              82.3680|
> > > |ClamAV                    51.3680                    51.1360                        50.3680                             50.7920|
> > > |consumer-typeset    34.4440                    34.2800                        34.3760                             34.3800|
> > > |kimwitu++               34.3680                    34.2480                        34.6080                              34.2800|
> > > |lencod                      49.8160                    49.7760                        49.2880                              49.2560|
> > > |mafft                        23.0160                    23.0360                         23.1760                              23.0200|
> > > |SPASS                      42.2720                     42.1440                        41.8800                              41.7720|
> > > |sqlite3                      25.0840                     25.1520                        24.9920                             25.1200|
> > > |tramp3d-v4             37.1520                      37.4240                        37.4560                              37.0240|
> > Thanks for the data!
> > I've added some geomean calculations:
> > 
> > | Benchmark	|Exynos-m1	|cortex-a57	|exynos-m1(pass disabled)	|cortex-a57(pass disabled) |
> > | 7zip	|123.724	|123.18	|122.636	|122.664 |
> > | Bullet	|82.636	|82.12	|82.084	|82.368 |
> > | ClamAV	|51.368	|51.136	|50.368	|50.792 |
> > | consumer-typeset	|34.444	|34.28	|34.376	|34.38 |
> > | kimwitu++	|34.368	|34.248	|34.608	|34.28 |
> > | lencod	|49.816	|49.776	|49.288	|49.256 |
> > | mafft	|23.016	|23.036	|23.176	|23.02 |
> > | SPASS	|42.272	|42.144	|41.88	|41.772 |
> > | sqlite3	|25.084	|25.152	|24.992	|25.12 |
> > | tramp3d-v4	|37.152	|37.424	|37.456	|37.024 |
> > | GEOMEAN	|44.14092425	|44.06844708	|43.9700723	|43.90935266 |
> > | COMPARED TO DISABLED	|100.39%	|100.36% |
> > 
> > So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
> > If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
> > I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
> > If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
> > @mcrosier : since you were concerned about compile time of this pass before, what do you think?
> > 
> 0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:
> 
> 
> 
> | Benchmark                  cortex-a57 (run1 run2 run3, best)                         cortex-a57 pass disabled (run1 run2 run3, best)|
> |7zip                            122.1760  123.1520  123.2600      122.17                        122.4800  123.2520  123.7160        122.48|
> |Bullet:                           83.2120   82.6360   82.6160         82.61                           82.9200   82.6880   82.6880           82.68|
> |ClamAV:                       51.3040   51.6800   51.5520          51.30                           50.9000   51.1160   51.1320          50.90|
> |consumer-typeset:      34.5240   34.4960   34.5600           34.49                           34.2480   34.5760   34.7560          34.24|
> |kimwitu++:                  34.5200   34.3800   34.4000          34.38                           34.6960   34.3600   34.6480          34.36|
> |lencod:                         50.1840   50.0480   50.1040          50.10                            49.9400   50.0160   49.8940         49.89|
> |mafft:                            23.1560   23.3640   23.4320         23.15                            23.3800   23.2920   23.3880         23.29|
> |9 SPASS:                       41.8960   42.0240   42.0440         41.89                            42.2440   42.0120   41.8920         41.89|
> |sqlite3:                          25.2840   25.1520   25.3520        25.15                             25.2520   25.3760   25.4280        25.25|
> |tramp3d-v4:                  39.1200   37.3160   37.4800        37.31                             37.6840   40.0840   37.5480        37.54|
> |Geomean:                                                                      44.21158                                                                         44.12472|
> |COMPARED TO DISABLED:                                           100.1968%|
> 
> Let me know if you would like to see the patch for that.
It's an option to try and reduce the number of instruction checked in the function shouldExitEarly, but I think that should only be considered as a last resort, as it makes the behaviour of the pass potentially "strange" if on some strange micro-architecture in the future, rewriting rules should be applied differently for e.g. st2.4s, st2.8h and st2.16h. I agree that's unlikely, but not impossible.
I think one more thing to try and do first is to reduce the amount of work done in shouldExitEarly when the results should already by cached.
The code for shouldExitEarly at the moment is:

```
bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction *MF, Subpass SP) {
  const MCInstrDesc* OriginalMCID;
  SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID;

  switch (SP) {
  case VectorElem:
    OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed);
    ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane));
    ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32));
    if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
      return false;
    break;
  case Interleave:
    for (auto &I : IRT) {
      OriginalMCID = &TII->get(I.OrigOpc);
      for (auto &Repl : I.ReplOpc)
        ReplInstrMCID.push_back(&TII->get(Repl));
      if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
        return false;
      ReplInstrMCID.clear();
    }
    break;
  }

  return true;
}
```

In the current implementation of this patch, even when all results are already cached, this still results in the for loop in "case Interleave" to still be traversed completely, IIUC, as the caching of results happens inside shouldReplaceInst. I think that it would be better to instead of, or additionally, cache the results at the "shouldExitEarly" level, i.e. cache if shouldExitEarly should return "true" or "false", so that it wouldn't have to traverse the for loop if on a previous invocation, the answer was already "true". Of course this need to take into account the subtarget targeted by the machine function.
My guess is that this could have enough impact on compile time reduction that no other changes would be needed.
What do you think?

https://reviews.llvm.org/D38196