[cfe-dev] [llvm-dev] Writing built-ins for instructions returning multiple operands

Fri Sep 11 02:46:18 PDT 2015

Thanks for this feedback.  Although my example (contrived) used an 'int', the actual instructions involved use vector operands so it’s a bit more tricky, but the approach you have outlined looks workable.  I had been avoiding the notion of "pass-by-reference", but the transformations you have outlined should allow me to represent this using pointers or references in C/C++, but lower to the intended instruction and eliminate the implied indirection.

All the best,

	MartinO

-----Original Message-----
From: Dr D. Chisnall [mailto:dc552 at hermes.cam.ac.uk] On Behalf Of David Chisnall
Sent: 09 September 2015 11:43
To: mats petersson <mats at planetcatfish.com>
Cc: Martin J. O'Riordan <martin.oriordan at movidius.com>; llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org
Subject: Re: [cfe-dev] [llvm-dev] Writing built-ins for instructions returning multiple operands

On 9 Sep 2015, at 11:31, mats petersson via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> However, if we have, say, an instruction that returns two distinct values (div that also gives the remainder, as a simple example), you will either have to return a (small) struct, or pass in a pointer to be filled in by the function [the latter is not ideal from an optimisation perspective, as the optimiser has a harder time knowing if the output is aliased with something else.

It’s important to differentiate the C builtin from the LLVM intrinsic.  It’s generally more useable (and idiomatic) in C to have additional return values become arguments returned by pointer.  It’s generally more useful in LLVM IR to have multiple return values as a struct.  For an example, consider the overflow-checked builtins.

The following C for a function that multiplies two numbers and returns either the result or 0 on overflow:

unsigned int mul(unsigned int x, unsigned int y) {
	unsigned int result;
	return __builtin_umul_overflow(x, y, &result) == 0 ? 0 : result; }

This becomes some fairly complex IR, with the key part being:

  %result = alloca i32, align 4
...
  %5 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %3, i32 %4) ...
  %7 = extractvalue { i32, i1 } %5, 0
  store i32 %7, i32* %result, align 4

The SROA happily turns this entire function into:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %4 = zext i1 %2 to i32
  %5 = icmp eq i32 %4, 0
  br i1 %5, label %6, label %7

; <label>:6                                       ; preds = %0
  br label %8

; <label>:7                                       ; preds = %0
  br label %8

; <label>:8                                       ; preds = %7, %6
  %9 = phi i32 [ 0, %6 ], [ %3, %7 ]
  ret i32 %9
}

SimplifyCFG then turns the branches into a single select:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %4 = zext i1 %2 to i32
  %5 = icmp eq i32 %4, 0
  %. = select i1 %5, i32 0, i32 %3
  ret i32 %.
}

And instcombine gets rid of the redundant zext / icmp:

define i32 @mul(i32 %x, i32 %y) #0 {
  %1 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %x, i32 %y)
  %2 = extractvalue { i32, i1 } %1, 1
  %3 = extractvalue { i32, i1 } %1, 0
  %. = select i1 %2, i32 %3, i32 0
  ret i32 %.
}

TL;DR version: Just because you expose a builtin to C as something that takes a pointer doesn’t mean that the optimisers will struggle with it if you expose a sensible LLVM IR intrinsic.

David