[PATCH] D32451: Improve profile-guided heuristics to use estimated trip count.

Mon May 22 15:33:31 PDT 2017

twoh added a comment.

Comparing the existing implementation and this patch, I don't observe noticeable compile time different with Video-SIMDBench. I compiled the benchmark suite 3 times each and the median was 27.12 sec vs 27.55 sec, while the average was 27.96sec vs 27.89 sec. There were some code size differences, but it is simply more vectorization results bigger code size. There's no difference between OptForSize and non-vectorized.

I observe that branch frequency metadata handling has been improve since I first submitted this diff, which makes difference for some benchmarks. For example, mc_chroma, which was not vectorized with existing implementation while vectorized with this patch, is now vectorized with the original implementation but not with this patch. However, branch frequency metadata are still not perfect, and actually I was able to find 4 loops in mc_luma that whose loop entry frequency information is available but not the estimated trip count. These loops are vectorized if we make a vectorization decision only based on a trip count, but not vectorized if we consider loop entry frequency if trip count is not available, because their entry frequency is smaller than the threshold. Also, there are loops whose trip counts are underestimated and miss the vectorization opportunity. By chance, these loops result better performance with existing implementation because the loop entry frequency is higher than the cold entry frequency.

In summary, I don't see much difference between OptForSize and non-vectorized, but see the potential of better vectorization decision with more precise profile info for this patch.

https://reviews.llvm.org/D32451