[PATCH] D32451: Improve profile-guided heuristics to use estimated trip count.

Thu Apr 27 14:36:22 PDT 2017

twoh added a comment.

So I evaluated loop vectorizer with https://github.com/malvanos/Video-SIMDBench, which is introduced in this paper: http://ieeexplore.ieee.org/document/7723550/. I first built it without profile data and ran linux perf with a command of

  perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=400009/ -b ./bench

. Linux perf data has been processed with autofdo tool (https://github.com/google/autofdo) and provided to following compilations for the evaluation.

There are 220 benchmarks in the set and 18 among them are affected by this patch. And these 18 benchmarks are reduced to 6 functions (for example, all benchmarks named `mc_chroma_?x?` call `mc_chroma` with different parameters). The 6 functions are `quant_4x4`, `quant_4x4x4`, `vbench_plane_copy_deinterleave_rgb_c`, `vbench_plane_copy_interleave_c`, `mc_chroma`, and `mc_luma`. 3 of the the 6 functions are only vectorized with existing heuristic, while other 3 are only vectorized with new heuristic. Let's go through them in detail.

---

- `quant_4x4x4` and `quant_4x4`

These functions were only vectorized with the original heuristic, because the profile data tells that the target loop's estimated execution count is 1. However, actually the target loops in these functions have a fixed execution count, which can be computed with ScalarEvolution::getSmallConstantTripCount function. So once I fix the heuristic to use profile-guided estimated trip count only when  ScalarEvolution::getSmallConstantTripCount fails to compute the actual trip count, both original and new heuristic generate same vectorized code.

- `vbench_plane_copy_deinterleave_rgb_c` and `vbench_plane_copy_interleave_c`

These functions are only vectorized with the new heuristic. For both of them, estimated loop entry count is less than 20% of estimated function entry point. This is odd because if you look at the source code the loop is supposed to be invoked whenever the encompassing function is invoked.

Comparing performance, `vbench_plane_copy_deinterleave_rgb_c` is **6.2 percent slower** with the new heuristic, which means vectorization harms the performance, while `vbench_plane_copy_interleave_c` is **5 times (not percent!) faster** with the new heuristic, which means vectorization is super beneficial.

| Function                             | Benchmark                   | Cycles (Original) | Cycles (New) | Difference(%) |
| vbench_plane_copy_deinterleave_rgb_c | plane_copy_deinterleave_rgb | 11358             | 12064        | +6.2%         |
| vbench_plane_copy_interleave_c       | plane_copy_interleave       | 14616             | 2815         | -80.7%        |

This seems counter-intuitive considering that the actual operations performed in these functions are practically identical (it is just a copy of array elements). However, the actual trip count of the target loop can explain the difference. If you see the trip count histogram of `vbench_plane_copy_deinterleave_rgb_c` across multiple invocations, it is

  {trip count: occurrence} = {1:13056, 2:13056, 5:13056, 8:13056, 64:26112, 66:13056, 126:13056, 132:13056, 476:13056}

,while for `vbench_plane_copy_interleave_c`, it is

  {trip count: occurrence} = {1:32256, 4:32256, 10:32256, 16:6528, 128:13056, 132:32256, 252:32256, 264:32256, 952:32256}

So for `vbench_plane_copy_deinterleave_rgb_c`, low-trip count invocations offset the benefit of vectorization from high-trip count invocations, but for `vbench_plane_copy_interleave_c`, as high-trip count invocations dominate the execution time, the benefit of vectorization is clearly observed.

- `mc_chroma` and `mc_luma`

Evaluation on these two functions clearly show the correlation between the trip count and the vectorization effect. There are 7 benchmarks associated with each function with different parameters. `mc_chroma` is only vectorized with the new heuristic, while `mc_luma` is only vectorized with the original heuristic. Below is the performance summary:

| Function  | Benchmark     | Cycles (Original) | Cycles (New) | Difference(%) |
| mc_chroma | mc_chroma_2x2 | 530               | 586          | +10.6%        |
|           | mc_chroma_2x4 | 840               | 921          | +9.6          |
|           | mc_chroma_4x2 | 816               | 861          | +5.5          |
|           | mc_chroma_4x4 | 1415              | 1465         | +3.5          |
|           | mc_chroma_4x8 | 2610              | 2669         | +2.3          |
|           | mc_chroma_8x4 | 2617              | 2676         | +2.3          |
|           | mc_chroma_8x8 | 5016              | 5103         | +1.7          |
| mc_luma   | mc_luma_4X4   | 552               | 489          | -11.4         |
|           | mc_luma_4X8   | 892               | 818          | -8.3          |
|           | mc_luma_8X4   | 753               | 730          | -3.1          |
|           | mc_luma_8X8   | 1301              | 1300         | -0.1          |
|           | mc_luma_8X16  | 2488              | 2437         | -2.0          |
|           | mc_luma_16X8  | 2264              | 2487         | +9.8          |
|           | mc_luma_16X16 | 4219              | 4701         | +11.4         |

As the table shows, for low trip count invocations (roughly below 8x8), non-vectorized code (original for mc_chroma and new for mc_luma) performs better, but for high trip count invocations, vectorized code(original mc_luma) performs better. Here, again, I suspect that the profile numbers might be wrong: The profile estimates the trip count for mc_chroma as 153 while the trip count for mc_luma as 3, which results different vectorization decision with new heuristic. (LoopEntryCount/ColdEntryCount were 8/20 for mc_chroma and 270404/23906 for mc_luma, which affects the original heuristic's vectorization decision). I guess the profile numbers might be messed up during the loop transformations, but I don't have an evidence for that yet.

---

So with the evaluation results, I think trip count is a better metric than the invocation count to estimate the effectiveness of vectorization. Also, I think we need to be more conservative about loop-vectorizing low trip count loops, and need to improve the precision of profile data across the optimization passes.

https://reviews.llvm.org/D32451