[PATCH] D12765: [LV] Allow vectorization of loops with induction post-inc expressions

Wed Sep 16 06:18:55 PDT 2015

kuhar added a comment.

Hi Michael,

I've spent the last to days trying to come up with a IndVarSimplify patch that would enable the unmodified LoopVectorizer to work on my code.
The problem is that currently IndVarSimplify::RewriteExitValues explicitly works only on expressions which are loop invariants and there is a cost check inside.
When I tired commenting out this HighCost check, benchmark scores were not great: the only benchmark significantly improved was automotive-susan (~ 21%), and there were many regressions, ex. (all the scores on A57/aarch64):
`lnt.MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk`	27.03%
`lnt.MultiSource/Benchmarks/sim/sim`	8.96%
`lnt.MultiSource/Benchmarks/Olden/bh/bh`	6.13%
`lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode`	6.07%
`lnt.SingleSource/Benchmarks/Stanford/Puzzle`	5.76%
`lnt.MultiSource/Benchmarks/Prolangs-C++/ocean/ocean`	5.34%
`lnt.MultiSource/Applications/lemon/lemon`	4.84%
`lnt.MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl`	4.73%
`lnt.MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2`	4.15%
There were also some severe regressions on our internal benchmarks.

I tried to come up with some heuristic approach like recursively counting the number of expensive operations coming SCEV expansion and comparing it with the number of instructions in the whole loop. After playing with it for some time I managed to get rid of these most serious regressions in lnt - the only ones left were:
`lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode`	6.14%
`lnt.MultiSource/Benchmarks/Olden/bh/bh`	5.39%
`lnt.SingleSource/Benchmarks/BenchmarkGame/recursive`	3.44%

The problem was that there were also not as many (little) improvements and that there were still some serious regressions in out benchmarks. Another thing is that even in the most aggressive configuration (a.k.a. always assume expansion cheap) I was not able to reproduce improvements coming form my original LoopVectorizer patch. I suspect that the difference comes from the fact that IndVarSimplify actually creates some new instructions, and my changes in the LoopVectorizer only reuse some existing values.

I think that the problem is just that in IndVarSimplify it's to early to determine if it's beneficial to rewrite some exit values and that it leads to new instructions being emitted - it's so easy to regress some stuff. Maybe some other optimization pass (like recently discussed LEV) could do some cleanup after loop vectorization, I'm not sure... Anyway, I think that the biggest argument for patching the LoopVectorizer is that it can regresses code very little (from what I've seen running multiple benchmarks.
Do you have some other ideas on performing the changes in IndVarSimplify, Michael?

Cheers,
Jakub

Repository:
  rL LLVM

http://reviews.llvm.org/D12765