[LLVMdev] Proposal: add intrinsics for safe division

Fri Apr 25 17:04:27 PDT 2014

On Apr 25, 2014, at 2:21 PM, Eric Christopher <echristo at gmail.com> wrote:

>> In short, I agree with your observations that these intrinsics are not an
>> obvious slam-dunk compared to making the explicit control flow, but I think
>> that the intrinsics do give enough flexibility on the LLVM side that it
>> would be great if front-ends used them rather than rolling the control flow
>> themselves.
>> 
> 
> The problem is that then we have 2 problems: All targets (except for
> arm64) then have to lower the intrinsic as the first thing they do
> (giving us a TTI pass as the first thing in the pipeline) to take
> advantage of the information later during optimization, _and_ we have
> to plumb all of the work optimizing the intrinsic as well giving us a
> situation where we've now split our optimization efforts as well as
> the pain of maintaining an intrinsic that's useful for a single
> target.
> 
> I really think that this is just solidifying my position that the
> intrinsic is a bad idea and that this should be done as later
> optimizations.

The goal of the intrinsic wasn't stated clearly.

This intrinsic isn't particularly necessary for any specific frontend
or architecture. I think the LLVM community would benefit from a
canonical way to represent well-defined integer division. We don't
need to add an IR instruction, because it's fine for most optimization
passes to ignore these operations. A target independent intrinsic
works well as long as:

- It cleanly captures the language level semantics.

- It facilitates mid-level optimization.

- It naturally lowers into ideal code both on architectures that trap
  (x86) and those that don't (arm).

To summarize my understanding of the concerns:

(1) The semantics of the safe.div intrinsic need to be useful for the
Language/ISA Matrix that LLVMers care about.

At canonical IR level, the intrinsic is useful by eliminating control
flow merges and representing divide-by-zero and/or signed overflow in
a canonical form:

    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %bit = extractvalue {i32, i1} %res, 1
    br i1 %bit, label %trap, label %continue
trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

The original proposal fails to achieve this because the common case of
Java/Go would require a check in the slow-path to differentiate
divide-by-zero from signed overflow. That should be fixed by
generalizing the intrinsic so that the two conditions are distinct:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %checkovf

checkovf:
    %ovf = extractvalue {i32, i1} %res, 2
    br i1 %div0, label %trap, label %continue

trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

...or some variation of the above. I don't have a personal stake in this.

(2) The safe.div intrinsic inhibits generic code motion and Correlated
Value Prop based optimization.

This goes both ways.

CVP could miss out on cases unless we teach it about the semantics of
the intrinsics. Correct me if I'm wrong, but I don't think this would
actually be too difficult.

OTOH, with the intrinsics, it would be easy to write a simple
optimization pass that hoists and combines checks along the domtree.

After divide-by-zero optimization, you would have something like:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %continue

trap:
    # No call here!
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

And the branch to trap just goes away at some point.

Now considering Reid's LICM example:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

Simply put, if we want to unswitch this code for x86, then we need a
TTI pass to lower the intrinsic. On the other hand, I believe it is
more important to eliminate the control flow within the loop to aid
loop analysis and other optimizations. So we still want the front end
to emit the intrinsic, we just may eventually want it lowered earlier
than CGP. I don't think this issue has any bearing on the intrinsic's
LangRef spec.

There was some comparison made to iload/istore, which I don't
follow:
- subsuming loads and stores into another instruction is really scary
  considering how much logic we have for analyzing the side effects of
  memory access.
- there is no benefit to IR optimization to the iload/istore intruction.
- it is easy to detect null checked load/stores in IR at any point.
- it is very rare that a platform would want to resume from trapping load/store.

(3) The safe.div intrinsic is a target-specific codegen optimization.

Regardless of the target, we want to eliminate control flow merges in
all cases, and completely eliminate control flow when we don't have
trapping semantics. To do this, we need an intrinsic at canonical IR
level. Clearly it should be a target independent intrinsic
since *some* optimizations/analyses want to be aware of it.

Even ignoring the rationale above, supporting codegen with a
target-specific intrinsic would minimally require the DAG builder to
match this
 "if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0)"

after all optimizations have had their way with it. A big problem here
is that there are actually many different forms of this expression and
surrounding control flow, depending on the frontend's integer division
semantics. We would have to recognize all the variations of checking for
divide-by-zero and/or overflow, trapping and/or producing a certain
constant, branching or selecting values. This is obviously terrible
from an engineering standpoint.

-Andy