[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Wed Nov 29 16:19:15 PST 2017

az marked 2 inline comments as done.
az added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:290-294
+  case Interleave:
+    for (auto &I : IRT) {
+      OriginalMCID = &TII->get(I.OrigOpc);
+      for (auto &Repl : I.ReplOpc)
+        ReplInstrMCID.push_back(&TII->get(Repl));
----------------
kristof.beyls wrote:
> az wrote:
> > kristof.beyls wrote:
> > > az wrote:
> > > > kristof.beyls wrote:
> > > > > This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
> > > > > I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
> > > > > I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).
> > > > Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs.  The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.
> > > > 
> > > > Let me know if you have any more comments.
> > > > 
> > > > |Benchmark         Exynos-m1               cortex-a57          exynos-m1 (pass disabled)     cortex-a57 (pass disabled)|
> > > > |7zip                        123.7240                   123.1800                     122.6360                            122.6640|
> > > > |Bullet                       82.6360                     82.1200                       82.0840                              82.3680|
> > > > |ClamAV                    51.3680                    51.1360                        50.3680                             50.7920|
> > > > |consumer-typeset    34.4440                    34.2800                        34.3760                             34.3800|
> > > > |kimwitu++               34.3680                    34.2480                        34.6080                              34.2800|
> > > > |lencod                      49.8160                    49.7760                        49.2880                              49.2560|
> > > > |mafft                        23.0160                    23.0360                         23.1760                              23.0200|
> > > > |SPASS                      42.2720                     42.1440                        41.8800                              41.7720|
> > > > |sqlite3                      25.0840                     25.1520                        24.9920                             25.1200|
> > > > |tramp3d-v4             37.1520                      37.4240                        37.4560                              37.0240|
> > > Thanks for the data!
> > > I've added some geomean calculations:
> > > 
> > > | Benchmark	|Exynos-m1	|cortex-a57	|exynos-m1(pass disabled)	|cortex-a57(pass disabled) |
> > > | 7zip	|123.724	|123.18	|122.636	|122.664 |
> > > | Bullet	|82.636	|82.12	|82.084	|82.368 |
> > > | ClamAV	|51.368	|51.136	|50.368	|50.792 |
> > > | consumer-typeset	|34.444	|34.28	|34.376	|34.38 |
> > > | kimwitu++	|34.368	|34.248	|34.608	|34.28 |
> > > | lencod	|49.816	|49.776	|49.288	|49.256 |
> > > | mafft	|23.016	|23.036	|23.176	|23.02 |
> > > | SPASS	|42.272	|42.144	|41.88	|41.772 |
> > > | sqlite3	|25.084	|25.152	|24.992	|25.12 |
> > > | tramp3d-v4	|37.152	|37.424	|37.456	|37.024 |
> > > | GEOMEAN	|44.14092425	|44.06844708	|43.9700723	|43.90935266 |
> > > | COMPARED TO DISABLED	|100.39%	|100.36% |
> > > 
> > > So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
> > > If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
> > > I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
> > > If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
> > > @mcrosier : since you were concerned about compile time of this pass before, what do you think?
> > > 
> > 0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:
> > 
> > 
> > 
> > | Benchmark                  cortex-a57 (run1 run2 run3, best)                         cortex-a57 pass disabled (run1 run2 run3, best)|
> > |7zip                            122.1760  123.1520  123.2600      122.17                        122.4800  123.2520  123.7160        122.48|
> > |Bullet:                           83.2120   82.6360   82.6160         82.61                           82.9200   82.6880   82.6880           82.68|
> > |ClamAV:                       51.3040   51.6800   51.5520          51.30                           50.9000   51.1160   51.1320          50.90|
> > |consumer-typeset:      34.5240   34.4960   34.5600           34.49                           34.2480   34.5760   34.7560          34.24|
> > |kimwitu++:                  34.5200   34.3800   34.4000          34.38                           34.6960   34.3600   34.6480          34.36|
> > |lencod:                         50.1840   50.0480   50.1040          50.10                            49.9400   50.0160   49.8940         49.89|
> > |mafft:                            23.1560   23.3640   23.4320         23.15                            23.3800   23.2920   23.3880         23.29|
> > |9 SPASS:                       41.8960   42.0240   42.0440         41.89                            42.2440   42.0120   41.8920         41.89|
> > |sqlite3:                          25.2840   25.1520   25.3520        25.15                             25.2520   25.3760   25.4280        25.25|
> > |tramp3d-v4:                  39.1200   37.3160   37.4800        37.31                             37.6840   40.0840   37.5480        37.54|
> > |Geomean:                                                                      44.21158                                                                         44.12472|
> > |COMPARED TO DISABLED:                                           100.1968%|
> > 
> > Let me know if you would like to see the patch for that.
> It's an option to try and reduce the number of instruction checked in the function shouldExitEarly, but I think that should only be considered as a last resort, as it makes the behaviour of the pass potentially "strange" if on some strange micro-architecture in the future, rewriting rules should be applied differently for e.g. st2.4s, st2.8h and st2.16h. I agree that's unlikely, but not impossible.
> I think one more thing to try and do first is to reduce the amount of work done in shouldExitEarly when the results should already by cached.
> The code for shouldExitEarly at the moment is:
> 
> ```
> bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction *MF, Subpass SP) {
>   const MCInstrDesc* OriginalMCID;
>   SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID;
> 
>   switch (SP) {
>   case VectorElem:
>     OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed);
>     ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane));
>     ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32));
>     if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
>       return false;
>     break;
>   case Interleave:
>     for (auto &I : IRT) {
>       OriginalMCID = &TII->get(I.OrigOpc);
>       for (auto &Repl : I.ReplOpc)
>         ReplInstrMCID.push_back(&TII->get(Repl));
>       if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
>         return false;
>       ReplInstrMCID.clear();
>     }
>     break;
>   }
> 
>   return true;
> }
> ```
> 
> In the current implementation of this patch, even when all results are already cached, this still results in the for loop in "case Interleave" to still be traversed completely, IIUC, as the caching of results happens inside shouldReplaceInst. I think that it would be better to instead of, or additionally, cache the results at the "shouldExitEarly" level, i.e. cache if shouldExitEarly should return "true" or "false", so that it wouldn't have to traverse the for loop if on a previous invocation, the answer was already "true". Of course this need to take into account the subtarget targeted by the machine function.
> My guess is that this could have enough impact on compile time reduction that no other changes would be needed.
> What do you think?
It seems that the caching mechanism to save the decision per target of whether to enter the instruction replacement pass really improved compile time. Thank you for the idea. I am including one set of runs below where cortex-a57 with pass is slightly worse in terms of compile time than cortex-a57 without the pass. In order to verify on this, I did few other sets of runs (not shown here) and the difference is usually very small including having a set of runs where with "the pass" outperforms "without the pass" in terms of compile time. It seems that the overhead of ShouldExitEarly is really now within the noise range.

| Benchmark           cortex-a57 (run1 run2 run3, best)                              cortex-a57 pass disabled (run1 run2 run3, best) | 
| 7zip                     122.4480   123.0200   122.7160         122.4480             123.0440    122.7360   123.5120        122.7360|
| Bullet                     82.3960     82.2520     83.1720           82.2520              82.7040      82.2640     83.5840          82.2640|
| ClamAV                  51.6320     51.0560    50.6320           50.6320              51.2920      51.3520     51.0040          51.0040|
| consumer-typese   34.1440     34.5600    34.3360           34.1440              34.7760      33.9560     34.6080          33.9560|
| kimwitu++             34.3360     34.4760    34.5440           34.3360              34.9320      34.6400      34.5960          34.5960|
| lencod                    49.9000     49.3280    50.3160           49.9000              49.6480      49.3800      50.0320         49.3800| 
| mafft                      23.5080     23.2760     23.0200          23.0200               23.3120      23.2120     23.6120          23.2120|
| SPASS                    42.3160     42.0920     42.3040          42.0920               41.9560      42.2640      41.8040         41.8040|
| sqlite3                   25.3840     26.3680     26.2400          25.3840               25.2080      25.2400     25.1760          25.1760|
| tramp3d-v4           37.5680     37.0600     37.2520          37.0600               37.3760      37.1960     37.1760          37.1760|
| Geomean                                                                       44.00964                                                                        43.98918| 
| Compared to disabled                                                100.0465%|

https://reviews.llvm.org/D38196