[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

Sat Feb 9 12:56:25 PST 2019

On 2/9/19 2:05 PM, Craig Topper wrote:
> Something like this should work I think.
> 
> ; ModuleID = 'test.ll'
> source_filename = "test.ll"
> 
> define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>* %x) {
> Entry:
>   %tmp = load <4 x i32>, <4 x i32>* %a, align 16
>   %tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
>   %tmp2 = add <4 x i32> %tmp, %tmp1
>   %tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
>   %tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
>   %sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
>   %signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
>   %summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
>   %overflow = and <4 x i1> %signsequal, %summismatch
>   %tmp5 = bitcast <4 x i1> %overflow to i4
>   %tmp6 = icmp ne i4 %tmp5, 0
>   br i1 %tmp6, label %OverflowFail, label %OverflowOk
> 
> OverflowFail:                                     ; preds = %Entry
>   tail call fastcc void @panic()
>   unreachable
> 
> OverflowOk:                                       ; preds = %Entry
>   store <4 x i32> %tmp2, <4 x i32>* %x, align 16
>   ret void
> }
> 
> declare fastcc void @panic()

Thanks! I was able to get it working with your hint:

>   %tmp5 = bitcast <4 x i1> %overflow to i4

(Thanks also to LebedevRI who pointed this out on IRC)

Until LLVM 9 when the llvm.*.with.overflow.* intrinsics gain vector
support, here's what I ended up with:

  %a = alloca <4 x i32>, align 16
  %b = alloca <4 x i32>, align 16
  %x = alloca <4 x i32>, align 16
  store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a, align 16,
!dbg !55
  store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b, align 16,
!dbg !56
  %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
  %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
  %2 = sext <4 x i32> %0 to <4 x i33>, !dbg !59
  %3 = sext <4 x i32> %1 to <4 x i33>, !dbg !59
  %4 = add <4 x i33> %2, %3, !dbg !59
  %5 = trunc <4 x i33> %4 to <4 x i32>, !dbg !59
  %6 = sext <4 x i32> %5 to <4 x i33>, !dbg !59
  %7 = icmp ne <4 x i33> %4, %6, !dbg !59
  %8 = bitcast <4 x i1> %7 to i4, !dbg !59
  %9 = icmp ne i4 %8, 0, !dbg !59
  br i1 %9, label %OverflowFail, label %OverflowOk, !dbg !59

Idea being: extend and do the operation with more bits. Truncate to get
the result. Re-extend the result and check if it is the same as the
pre-truncated result.

This works pretty well unless the vector integer size is as big or
larger than the native vector register. Here's a quick performance test:

https://gist.github.com/andrewrk/b9734f9c310d8b79ec7271e7c0df4023

Summary: safety-checked integer addition with no optimizations

<4 x i32>:
scalar = 893 MiB/s
vector = 3.58 GiB/s

<16 x i128>:
scalar = 3.6 GiB/s
vector = 2.5 GiB/s

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/ddeb0994/attachment.sig>