[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos
Abderrazek Zaafrani via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Nov 21 09:35:53 PST 2017
az marked 2 inline comments as done.
az added a comment.
i
================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:290-294
+ case Interleave:
+ for (auto &I : IRT) {
+ OriginalMCID = &TII->get(I.OrigOpc);
+ for (auto &Repl : I.ReplOpc)
+ ReplInstrMCID.push_back(&TII->get(Repl));
----------------
kristof.beyls wrote:
> az wrote:
> > kristof.beyls wrote:
> > > This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
> > > I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
> > > I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).
> > Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs. The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.
> >
> > Let me know if you have any more comments.
> >
> > |Benchmark Exynos-m1 cortex-a57 exynos-m1 (pass disabled) cortex-a57 (pass disabled)|
> > |7zip 123.7240 123.1800 122.6360 122.6640|
> > |Bullet 82.6360 82.1200 82.0840 82.3680|
> > |ClamAV 51.3680 51.1360 50.3680 50.7920|
> > |consumer-typeset 34.4440 34.2800 34.3760 34.3800|
> > |kimwitu++ 34.3680 34.2480 34.6080 34.2800|
> > |lencod 49.8160 49.7760 49.2880 49.2560|
> > |mafft 23.0160 23.0360 23.1760 23.0200|
> > |SPASS 42.2720 42.1440 41.8800 41.7720|
> > |sqlite3 25.0840 25.1520 24.9920 25.1200|
> > |tramp3d-v4 37.1520 37.4240 37.4560 37.0240|
> Thanks for the data!
> I've added some geomean calculations:
>
> | Benchmark |Exynos-m1 |cortex-a57 |exynos-m1(pass disabled) |cortex-a57(pass disabled) |
> | 7zip |123.724 |123.18 |122.636 |122.664 |
> | Bullet |82.636 |82.12 |82.084 |82.368 |
> | ClamAV |51.368 |51.136 |50.368 |50.792 |
> | consumer-typeset |34.444 |34.28 |34.376 |34.38 |
> | kimwitu++ |34.368 |34.248 |34.608 |34.28 |
> | lencod |49.816 |49.776 |49.288 |49.256 |
> | mafft |23.016 |23.036 |23.176 |23.02 |
> | SPASS |42.272 |42.144 |41.88 |41.772 |
> | sqlite3 |25.084 |25.152 |24.992 |25.12 |
> | tramp3d-v4 |37.152 |37.424 |37.456 |37.024 |
> | GEOMEAN |44.14092425 |44.06844708 |43.9700723 |43.90935266 |
> | COMPARED TO DISABLED |100.39% |100.36% |
>
> So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
> If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
> I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
> If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
> @mcrosier : since you were concerned about compile time of this pass before, what do you think?
>
0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:
| Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)|
|7zip 122.1760 123.1520 123.2600 122.17 122.4800 123.2520 123.7160 122.48|
|Bullet: 83.2120 82.6360 82.6160 82.61 82.9200 82.6880 82.6880 82.68|
|ClamAV: 51.3040 51.6800 51.5520 51.30 50.9000 51.1160 51.1320 50.90|
|consumer-typeset: 34.5240 34.4960 34.5600 34.49 34.2480 34.5760 34.7560 34.24|
|kimwitu++: 34.5200 34.3800 34.4000 34.38 34.6960 34.3600 34.6480 34.36|
|lencod: 50.1840 50.0480 50.1040 50.10 49.9400 50.0160 49.8940 49.89|
|mafft: 23.1560 23.3640 23.4320 23.15 23.3800 23.2920 23.3880 23.29|
|9 SPASS: 41.8960 42.0240 42.0440 41.89 42.2440 42.0120 41.8920 41.89|
|sqlite3: 25.2840 25.1520 25.3520 25.15 25.2520 25.3760 25.4280 25.25|
|tramp3d-v4: 39.1200 37.3160 37.4800 37.31 37.6840 40.0840 37.5480 37.54|
|Geomean: 44.21158 44.12472|
|COMPARED TO DISABLED: 100.1968%|
Let me know if you would like to see the patch for that.
https://reviews.llvm.org/D38196
More information about the llvm-commits
mailing list