[PATCH] D31965: [SLP] Enable 64-bit wide vectorization for Cyclone

Wed Apr 12 08:08:01 PDT 2017

kristof.beyls added a comment.

In https://reviews.llvm.org/D31965#724804, @anemet wrote:

> Hi Renato,
>
> In https://reviews.llvm.org/D31965#724619, @rengolin wrote:
>
> > Hi Adam,
> >
> > Interesting results! But it doesn't sound like this is Cyclone specific.
>
>
> Sure it's not, it is just a deployment strategy for this change.  See the FIXME in the code.
>
> Rolling it out for Cyclone-only is just a way to get this going in a controllable manner.  Other subtargets can roll it this out as people find the time to benchmark and tune this.
>
> As the results section shows I had and still doing some tuning on this.  This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

>> @kristof.beyls Can you check on A57?
> 
> That would be great.  Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer	8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".
External/SPEC/CINT2000/256.bzip2/256.bzip2	2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex	2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet	-3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa	-1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc	-1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

> Adam

https://reviews.llvm.org/D31965