[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos
Abderrazek Zaafrani via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Nov 29 16:19:15 PST 2017
az marked 2 inline comments as done.
az added inline comments.
================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:290-294
+ case Interleave:
+ for (auto &I : IRT) {
+ OriginalMCID = &TII->get(I.OrigOpc);
+ for (auto &Repl : I.ReplOpc)
+ ReplInstrMCID.push_back(&TII->get(Repl));
----------------
kristof.beyls wrote:
> az wrote:
> > kristof.beyls wrote:
> > > az wrote:
> > > > kristof.beyls wrote:
> > > > > This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
> > > > > I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
> > > > > I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).
> > > > Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs. The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.
> > > >
> > > > Let me know if you have any more comments.
> > > >
> > > > |Benchmark Exynos-m1 cortex-a57 exynos-m1 (pass disabled) cortex-a57 (pass disabled)|
> > > > |7zip 123.7240 123.1800 122.6360 122.6640|
> > > > |Bullet 82.6360 82.1200 82.0840 82.3680|
> > > > |ClamAV 51.3680 51.1360 50.3680 50.7920|
> > > > |consumer-typeset 34.4440 34.2800 34.3760 34.3800|
> > > > |kimwitu++ 34.3680 34.2480 34.6080 34.2800|
> > > > |lencod 49.8160 49.7760 49.2880 49.2560|
> > > > |mafft 23.0160 23.0360 23.1760 23.0200|
> > > > |SPASS 42.2720 42.1440 41.8800 41.7720|
> > > > |sqlite3 25.0840 25.1520 24.9920 25.1200|
> > > > |tramp3d-v4 37.1520 37.4240 37.4560 37.0240|
> > > Thanks for the data!
> > > I've added some geomean calculations:
> > >
> > > | Benchmark |Exynos-m1 |cortex-a57 |exynos-m1(pass disabled) |cortex-a57(pass disabled) |
> > > | 7zip |123.724 |123.18 |122.636 |122.664 |
> > > | Bullet |82.636 |82.12 |82.084 |82.368 |
> > > | ClamAV |51.368 |51.136 |50.368 |50.792 |
> > > | consumer-typeset |34.444 |34.28 |34.376 |34.38 |
> > > | kimwitu++ |34.368 |34.248 |34.608 |34.28 |
> > > | lencod |49.816 |49.776 |49.288 |49.256 |
> > > | mafft |23.016 |23.036 |23.176 |23.02 |
> > > | SPASS |42.272 |42.144 |41.88 |41.772 |
> > > | sqlite3 |25.084 |25.152 |24.992 |25.12 |
> > > | tramp3d-v4 |37.152 |37.424 |37.456 |37.024 |
> > > | GEOMEAN |44.14092425 |44.06844708 |43.9700723 |43.90935266 |
> > > | COMPARED TO DISABLED |100.39% |100.36% |
> > >
> > > So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
> > > If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
> > > I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
> > > If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
> > > @mcrosier : since you were concerned about compile time of this pass before, what do you think?
> > >
> > 0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:
> >
> >
> >
> > | Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)|
> > |7zip 122.1760 123.1520 123.2600 122.17 122.4800 123.2520 123.7160 122.48|
> > |Bullet: 83.2120 82.6360 82.6160 82.61 82.9200 82.6880 82.6880 82.68|
> > |ClamAV: 51.3040 51.6800 51.5520 51.30 50.9000 51.1160 51.1320 50.90|
> > |consumer-typeset: 34.5240 34.4960 34.5600 34.49 34.2480 34.5760 34.7560 34.24|
> > |kimwitu++: 34.5200 34.3800 34.4000 34.38 34.6960 34.3600 34.6480 34.36|
> > |lencod: 50.1840 50.0480 50.1040 50.10 49.9400 50.0160 49.8940 49.89|
> > |mafft: 23.1560 23.3640 23.4320 23.15 23.3800 23.2920 23.3880 23.29|
> > |9 SPASS: 41.8960 42.0240 42.0440 41.89 42.2440 42.0120 41.8920 41.89|
> > |sqlite3: 25.2840 25.1520 25.3520 25.15 25.2520 25.3760 25.4280 25.25|
> > |tramp3d-v4: 39.1200 37.3160 37.4800 37.31 37.6840 40.0840 37.5480 37.54|
> > |Geomean: 44.21158 44.12472|
> > |COMPARED TO DISABLED: 100.1968%|
> >
> > Let me know if you would like to see the patch for that.
> It's an option to try and reduce the number of instruction checked in the function shouldExitEarly, but I think that should only be considered as a last resort, as it makes the behaviour of the pass potentially "strange" if on some strange micro-architecture in the future, rewriting rules should be applied differently for e.g. st2.4s, st2.8h and st2.16h. I agree that's unlikely, but not impossible.
> I think one more thing to try and do first is to reduce the amount of work done in shouldExitEarly when the results should already by cached.
> The code for shouldExitEarly at the moment is:
>
> ```
> bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction *MF, Subpass SP) {
> const MCInstrDesc* OriginalMCID;
> SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID;
>
> switch (SP) {
> case VectorElem:
> OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed);
> ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane));
> ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32));
> if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
> return false;
> break;
> case Interleave:
> for (auto &I : IRT) {
> OriginalMCID = &TII->get(I.OrigOpc);
> for (auto &Repl : I.ReplOpc)
> ReplInstrMCID.push_back(&TII->get(Repl));
> if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
> return false;
> ReplInstrMCID.clear();
> }
> break;
> }
>
> return true;
> }
> ```
>
> In the current implementation of this patch, even when all results are already cached, this still results in the for loop in "case Interleave" to still be traversed completely, IIUC, as the caching of results happens inside shouldReplaceInst. I think that it would be better to instead of, or additionally, cache the results at the "shouldExitEarly" level, i.e. cache if shouldExitEarly should return "true" or "false", so that it wouldn't have to traverse the for loop if on a previous invocation, the answer was already "true". Of course this need to take into account the subtarget targeted by the machine function.
> My guess is that this could have enough impact on compile time reduction that no other changes would be needed.
> What do you think?
It seems that the caching mechanism to save the decision per target of whether to enter the instruction replacement pass really improved compile time. Thank you for the idea. I am including one set of runs below where cortex-a57 with pass is slightly worse in terms of compile time than cortex-a57 without the pass. In order to verify on this, I did few other sets of runs (not shown here) and the difference is usually very small including having a set of runs where with "the pass" outperforms "without the pass" in terms of compile time. It seems that the overhead of ShouldExitEarly is really now within the noise range.
| Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best) |
| 7zip 122.4480 123.0200 122.7160 122.4480 123.0440 122.7360 123.5120 122.7360|
| Bullet 82.3960 82.2520 83.1720 82.2520 82.7040 82.2640 83.5840 82.2640|
| ClamAV 51.6320 51.0560 50.6320 50.6320 51.2920 51.3520 51.0040 51.0040|
| consumer-typese 34.1440 34.5600 34.3360 34.1440 34.7760 33.9560 34.6080 33.9560|
| kimwitu++ 34.3360 34.4760 34.5440 34.3360 34.9320 34.6400 34.5960 34.5960|
| lencod 49.9000 49.3280 50.3160 49.9000 49.6480 49.3800 50.0320 49.3800|
| mafft 23.5080 23.2760 23.0200 23.0200 23.3120 23.2120 23.6120 23.2120|
| SPASS 42.3160 42.0920 42.3040 42.0920 41.9560 42.2640 41.8040 41.8040|
| sqlite3 25.3840 26.3680 26.2400 25.3840 25.2080 25.2400 25.1760 25.1760|
| tramp3d-v4 37.5680 37.0600 37.2520 37.0600 37.3760 37.1960 37.1760 37.1760|
| Geomean 44.00964 43.98918|
| Compared to disabled 100.0465%|
https://reviews.llvm.org/D38196
More information about the llvm-commits
mailing list