[PATCH] D105632: [LV] Use lookThroughAnd with logical reductions

Wed Jul 14 13:28:21 PDT 2021

dmgreen added a comment.

The test here might be OK. But what about when the type from the And mask and the reduction do not match?

Something like this, where the mask of the And is smaller than the bitwidth from the extend. The loads might also just not be extended, and the final result might not be truncated.

  define i8 @reduction_umax_trunc(i8* noalias nocapture %ptr) {
  ; CHECK-LABEL: @reduction_umax_trunc(
  ; CHECK-NEXT:  entry:
  ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
  ; CHECK:       vector.ph:
  ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
  ; CHECK:       vector.body:
  ; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
  ; CHECK-NEXT:    [[VEC_PHI:%.*]] = phi <8 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP6:%.*]], [[VECTOR_BODY]] ]
  ; CHECK-NEXT:    [[TMP0:%.*]] = and <8 x i32> [[VEC_PHI]], <i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127, i32 127>
  ; CHECK-NEXT:    [[TMP1:%.*]] = sext i32 [[INDEX]] to i64
  ; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i8, i8* [[PTR:%.*]], i64 [[TMP1]]
  ; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i8* [[TMP2]] to <8 x i8>*
  ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <8 x i8>, <8 x i8>* [[TMP3]], align 1
  ; CHECK-NEXT:    [[TMP4:%.*]] = zext <8 x i8> [[WIDE_LOAD]] to <8 x i32>
  ; CHECK-NEXT:    [[TMP5:%.*]] = icmp ugt <8 x i32> [[TMP0]], [[TMP4]]
  ; CHECK-NEXT:    [[TMP6]] = select <8 x i1> [[TMP5]], <8 x i32> [[TMP0]], <8 x i32> [[TMP4]]
  ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
  ; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], 256
  ; CHECK-NEXT:    br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
  ; CHECK:       middle.block:
  ; CHECK-NEXT:    [[TMP8:%.*]] = call i32 @llvm.vector.reduce.umax.v8i32(<8 x i32> [[TMP6]])
  ; CHECK-NEXT:    [[EXTRACT_T:%.*]] = trunc i32 [[TMP8]] to i8
  ; CHECK-NEXT:    br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
  ; CHECK:       scalar.ph:
  ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
  ; CHECK:       for.body:
  ; CHECK-NEXT:    br i1 undef, label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP15:![0-9]+]]
  ; CHECK:       for.end:
  ; CHECK-NEXT:    [[MIN_LCSSA_OFF0:%.*]] = phi i8 [ undef, [[FOR_BODY]] ], [ [[EXTRACT_T]], [[MIDDLE_BLOCK]] ]
  ; CHECK-NEXT:    ret i8 [[MIN_LCSSA_OFF0]]
  ;
  entry:
    br label %for.body

  for.body:
    %iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
    %sum.02p = phi i32 [ %max, %for.body ], [ 0, %entry ]
    %sum.02 = and i32 %sum.02p, 127
    %gep = getelementptr inbounds i8, i8* %ptr, i32 %iv
    %load = load i8, i8* %gep
    %ext = zext i8 %load to i32
    %icmp = icmp ugt i32 %sum.02, %ext
    %max = select i1 %icmp, i32 %sum.02, i32 %ext
    %iv.next = add i32 %iv, 1
    %exitcond = icmp eq i32 %iv.next, 256
    br i1 %exitcond, label %for.end, label %for.body

  for.end:
    %ret = trunc i32 %max to i8
    ret i8 %ret
  }

The order of max's is no longer the same, and the And will cut off bits from the value. So for example with values ptr[254] = 0x90 and ptr[255] = 0x70. In the scalar code it would have %max=0x90 on the penultimate iteration and on the final iteration %sum.02 = 0x90&0x7f = 0x10,  %max=umax(0x10, 0x70) = 0x70, which is the final result.  The vector version will compute its final iteration with exiting %max values <.., .., 0x90, 0x70>. This then goes into the vector.reduce.umax, producing 0x90.

It feel conceptually odd to me to treat max(and(x, mask), y) the same as max(x, y). They make sense for add and mul, and I think for and/or/xor too providing it's the bottom bits that are demanded. It may be possible to add an extra And with a mask in to get the same results. But I'm not sure there isn't anything else that would go wrong, considering how subtle the issues are here! And the various assumption other code would end up making.

Do the min/max cases come up in practice from C code?

================
Comment at: llvm/test/Transforms/LoopVectorize/trunc-reductions.ll:85
+  %iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
+  %sum.02p = phi i32 [ %xor, %for.body ], [ 65535, %entry ]
+  %sum.02 = and i32 %sum.02p, 65535
----------------
Can you change the start value to 0. Otherwise the loop doesn't really do much, other than return 65535.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105632/new/

https://reviews.llvm.org/D105632