[PATCH] D31965: [SLP] Enable 64-bit wide vectorization for Cyclone

Wed Apr 12 08:44:03 PDT 2017

kristof.beyls added a comment.

In https://reviews.llvm.org/D31965#724869, @anemet wrote:

> Hi Kristof,
>
> In https://reviews.llvm.org/D31965#724860, @kristof.beyls wrote:
>
> > In https://reviews.llvm.org/D31965#724804, @anemet wrote:
> >
> > > Hi Renato,
> > >
> > > In https://reviews.llvm.org/D31965#724619, @rengolin wrote:
> > >
> > > > Hi Adam,
> > > >
> > > > Interesting results! But it doesn't sound like this is Cyclone specific.
> > >
> > >
> > > Sure it's not, it is just a deployment strategy for this change.  See the FIXME in the code.
> > >
> > > Rolling it out for Cyclone-only is just a way to get this going in a controllable manner.  Other subtargets can roll it this out as people find the time to benchmark and tune this.
> > >
> > > As the results section shows I had and still doing some tuning on this.  This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.
> >
> >
> > I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
> >  Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.
> >
> > I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?
>
>
> Yes, do you have https://reviews.llvm.org/rL299482?

Yes, I ran this on top of r299981.

>> I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
>>  I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
>>  I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?
> 
> I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets.  As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

I am open to trying out any idea that improves over the current situation, and your idea seems to do so! If other subtarget "owners" also like this idea, let's try it out!

>>>> @kristof.beyls Can you check on A57?
>>> 
>>> That would be great.  Thanks!
>> 
>> So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
>>  Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:
> 
> Thanks!
> 
>> Performance Regressions - Execution Time
>>  MultiSource/Benchmarks/VersaBench/beamformer/beamformer	8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".
> 
> This is very interesting if you have https://reviews.llvm.org/rL299482.  This is a reversal from Cyclone where this results in a nice gain.  The cost model is at the threshold and the perceived benefit is minimal (-1, 0 being the threshold).  FTR, beyond https://reviews.llvm.org/rL299482, I have no further plans to work on this.

That's a useful insight, thanks for sharing!

>> External/SPEC/CINT2000/256.bzip2/256.bzip2	2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
>>  External/SPEC/CINT2000/255.vortex/255.vortex	2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.
>> 
>> Performance Improvements - Execution
>>  MultiSource/Benchmarks/Bullet/bullet	-3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
>>  External/SPEC/CFP2000/177.mesa/177.mesa	-1.69%: vectorization now happens in some of the hottest basic blocks.
>>  External/SPEC/CINT2000/176.gcc/176.gcc	-1.42%: I didn't have time to analyze this one further.
>> 
>> In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.
> 
> From the above results, I expect bzip2 to improve but unfortunately nothing else before I'd like to enable this.  Are you still OK with this?

Yes, I'd still be fine with enabling this.

Thanks for all your efforts on this!

Kristof

https://reviews.llvm.org/D31965