[PATCH] D59710: [SLP] remove lower limit for forming reduction patterns

Fri Jun 14 07:42:41 PDT 2019

arsenm added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll:21
 ; GFX9-NEXT:    [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
-; GFX9-NEXT:    [[CMP1:%.*]] = icmp sgt i32 [[TMP2]], [[TMP3]]
-; GFX9-NEXT:    [[SELECT1:%.*]] = select i1 [[CMP1]], i32 [[TMP2]], i32 [[TMP3]]
-; GFX9-NEXT:    [[TMP4:%.*]] = load <4 x i32>, <4 x i32>* bitcast (i32* getelementptr inbounds ([32 x i32], [32 x i32]* @arr, i64 0, i64 2) to <4 x i32>*), align 8
-; GFX9-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
-; GFX9-NEXT:    [[RDX_MINMAX_CMP:%.*]] = icmp sgt <4 x i32> [[TMP4]], [[RDX_SHUF]]
-; GFX9-NEXT:    [[RDX_MINMAX_SELECT:%.*]] = select <4 x i1> [[RDX_MINMAX_CMP]], <4 x i32> [[TMP4]], <4 x i32> [[RDX_SHUF]]
-; GFX9-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <4 x i32> [[RDX_MINMAX_SELECT]], <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
-; GFX9-NEXT:    [[RDX_MINMAX_CMP2:%.*]] = icmp sgt <4 x i32> [[RDX_MINMAX_SELECT]], [[RDX_SHUF1]]
-; GFX9-NEXT:    [[RDX_MINMAX_SELECT3:%.*]] = select <4 x i1> [[RDX_MINMAX_CMP2]], <4 x i32> [[RDX_MINMAX_SELECT]], <4 x i32> [[RDX_SHUF1]]
-; GFX9-NEXT:    [[TMP5:%.*]] = extractelement <4 x i32> [[RDX_MINMAX_SELECT3]], i32 0
-; GFX9-NEXT:    [[TMP6:%.*]] = icmp sgt i32 [[TMP5]], [[SELECT1]]
-; GFX9-NEXT:    [[OP_EXTRA:%.*]] = select i1 [[TMP6]], i32 [[TMP5]], i32 [[SELECT1]]
-; GFX9-NEXT:    [[STORE_SELECT:%.*]] = select i1 [[CMP1]], i32 3, i32 4
+; GFX9-NEXT:    [[TMP4:%.*]] = load <2 x i32>, <2 x i32>* bitcast (i32* getelementptr inbounds ([32 x i32], [32 x i32]* @arr, i64 0, i64 2) to <2 x i32>*), align 8
+; GFX9-NEXT:    [[LOAD5:%.*]] = load i32, i32* getelementptr inbounds ([32 x i32], [32 x i32]* @arr, i64 0, i64 4), align 16
----------------
RKSimon wrote:
> ABataev wrote:
> > Check this test and a few next. Previously, we had reduction of 4 elements, now we have a reduction for 2 elements only. This patch makes it worse than it was before.
> @arsenm maybe able to confirm but AFAICT the AMDGPU changes don't appear to be relevant as for anything but i16 types it will scalarize in the backend anyhow and we're just seeing the side-effects of mostly zero costs for min/max, shuffle and extract/insert operations.
> 
> The i16 reduction tests in AMDGPU\reduction.ll are more relevant and are not affected by this patch.
GFX9 has min/max for <2 x i16>. Just about every 32-bit op is scalarized, except a few that can be treated as i64. This also shrinks the load, which is worse (but SLP for some reason usually does this, and largely why there is LoadStoreVectorizer)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59710/new/

https://reviews.llvm.org/D59710