[PATCH] D36454: [X86] Changes to extract Horizontal addition operation for AVX-512.

Wed Oct 25 19:29:09 PDT 2017

hfinkel added a comment.

In https://reviews.llvm.org/D36454#884427, @jbhateja wrote:

> In https://reviews.llvm.org/D36454#884252, @RKSimon wrote:
>
> > I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.
> >
> > PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.
>
>
> Hi Simon,
>
> Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts. 
>  It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

In my opinion, @llvm.experimental.vector.reduce.fadd and friends should be treated as the canonical forms of those operations. InstCombine should form these intrinsics upon encountering these shuffle patterns.

The SLPVectorizer, LoopVectorizer, etc. should use the intrinsics when handling relevant reductions.

Is there an advantage to handling the shuffle patterns in the backend directly as opposed to forming the intrinsics earlier and then handling them in the backend?

> define float @fhsub_16(<16 x float> %x225) {
>  ;define <16 x float> @fhsub_16(<16 x float> %x225) {
> 
>   %x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>   %x227 = fadd <16 x float> %x225, %x226
>   %x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
>   %x229 = fsub <16 x float> %x227, %x228
>   %x230 = extractelement <16 x float> %x229, i32 0
>   ret float %x230
> 
> }
> 
> This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch https://reviews.llvm.org/D36650 is also suggesting for similar effort.
> 
> PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.
> 
> Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.
> 
> Thanks

https://reviews.llvm.org/D36454