[PATCH] D52286: [Intrinsic] Signed Saturation Intirnsic

Mon Oct 8 02:03:53 PDT 2018

ebevhan added a comment.

I've been experimenting a bit with early expansion of our sat intrinsic to see how the code generation is affected by it. In general, it doesn't actually seem like there's much of an effect. In fact, neither our benchmarks nor the user code I'm looking at seem to be terribly affected by expanding the intrinsic. This is probably due to most of the instances of saturation being 'locked down' by surrounding intrinsics.

However, it's definitely possible to construct quite simple cases that are worsened by expansion. Here's an example. Obviously I'm the only one who can compile this, but I think the idea gets across:

  __fixed ac(__accum a, unsigned int n) {
    a = a < 0.0a ? 0.0a : a;
    __accum s = 0.0a;
    for (unsigned i = 0; i < n; i++)
      s += (__sat __fixed)a;
    return (__sat __fixed)s;
  }

If we expand the saturation intrinsic (used for the casts from `__accum` to `__sat __fixed`) right after IR emission, our final optimized IR becomes:

  define i16 @ac(i24 %a, i16 %n) #0 {
  entry:
    %0 = icmp sgt i24 %a, 0
    %cond = select i1 %0, i24 %a, i24 0
    %cmp19 = icmp eq i16 %n, 0
    br i1 %cmp19, label %.thread13, label %for.cond.cleanup

  for.cond.cleanup:                                 ; preds = %entry
    %1 = icmp slt i24 %cond, 32767
    %2 = select i1 %1, i24 %cond, i24 32767
    %3 = and i24 %2, 65535
    %4 = add i16 %n, -1
    %5 = zext i16 %4 to i24
    %6 = add nuw nsw i24 %5, 1
    %7 = mul i24 %6, %3
    %8 = icmp sgt i24 %7, -32768
    br i1 %8, label %9, label %.thread13

  ; <label>:9:                                      ; preds = %for.cond.cleanup
    %10 = icmp slt i24 %7, 32767
    %extract.t15 = trunc i24 %7 to i16
    br i1 %10, label %.thread13, label %11

  .thread13:                                        ; preds = %9, %for.cond.cleanup, %entry
    %.off014 = phi i16 [ %extract.t15, %9 ], [ 0, %entry ], [ -32768, %for.cond.cleanup ]
    br label %11

  ; <label>:11:                                     ; preds = %.thread13, %9
    %.off0 = phi i16 [ %.off014, %.thread13 ], [ 32767, %9 ]
    ret i16 %.off0
  }

For our target, this doesn't select any saturation instructions, and consists of 19 static cycles.

If we keep the saturation intrinsic instead:

  define i16 @ac(i24 %a, i16 %n) {
  entry:
    %cmp17 = icmp eq i16 %n, 0
    br i1 %cmp17, label %for.cond.cleanup, label %for.body.lr.ph

  for.body.lr.ph:                                   ; preds = %entry
    %0 = icmp sgt i24 %a, 0
    %cond = select i1 %0, i24 %a, i24 0
    %1 = tail call i24 @llvm.sat.i24(i24 %cond, i32 16)
    %2 = add i16 %n, -1
    %3 = zext i16 %2 to i24
    %4 = shl i24 %1, 8
    %5 = ashr exact i24 %4, 8
    %6 = add nuw nsw i24 %3, 1
    %7 = mul i24 %6, %5
    br label %for.cond.cleanup

  for.cond.cleanup:                                 ; preds = %for.body.lr.ph, %entry
    %s.0.lcssa = phi i24 [ 0, %entry ], [ %7, %for.body.lr.ph ]
    %8 = tail call i24 @llvm.sat.i24(i24 %s.0.lcssa, i32 16)
    %resize3 = trunc i24 %8 to i16
    ret i16 %resize3
  }

This builds to 13 cycles.

I suppose you could consider this case to be a bit contrived, since the loop is eliminated, but it's still the case that the optimizer mangled the original saturation operations. Essentially anything that results in the optimizer moving the max and min selects away from each other will accomplish this.

Repository:
  rL LLVM

https://reviews.llvm.org/D52286