[llvm-dev] Rotates, once again

Tue May 15 15:34:51 PDT 2018

Thanks for writing this up. I'd like to have this intrinsic too.

Another argument for having the intrinsic is shown in PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426

Vectorization goes overboard because the throughput cost model used by the
vectorizers doesn't match the 6 IR instructions that correspond to 1 x86
rotate instruction. Instead, we have:

$ opt 37426prevectorize.ll -S -cost-model -analyze
...
Cost Model: Found an estimated cost of 1 for instruction:   %and = and i32
%cond, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shl = shl i32
%1, %and
Cost Model: Found an estimated cost of 1 for instruction:   %sub = sub nsw
i32 0, %cond
Cost Model: Found an estimated cost of 1 for instruction:   %and5 = and i32
%sub, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shr = lshr i32
%1, %and5
Cost Model: Found an estimated cost of 1 for instruction:   %or = or i32
%shl, %shr

The broken cost model also affects unrolling and inlining. Size costs are
overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the
context of min/max/abs/fma too). But it would be simpler if we had a rotate
intrinsic, and the 6-to-1 margin is the biggest I've seen.

Another potential benefit of a generic intrinsic is that it can replace
target-specific intrinsics. PowerPC and x86 have those. ARM translates
source-level target-specific intrinsics into regular IR, so that could use
the intrinsic too.

On Mon, May 14, 2018 at 2:53 AM, Fabian Giesen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi everyone!
>
> I recently ran into some interesting issues with generation of rotate
> instructions - the details are in the bug tracker (
> https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those
> interested - and it brought up the issue of rotates in the IR again. Now
> this is a proposal that has been made (and been rejected) several times,
> but I've been told that this time round we ran into issues not previously
> considered, and that I should bring the topic up on llvm-dev.
>
> I'll try to summarize the status quo first and then bring up the tricky
> cases that might tip the balance in favor of rotate intrinsics.
>
> Rotates
> =======
>
> Rotates, or circular shifts, are bit shifts where bits "dropped out" on
> one side of the operand are shifted back in on the other.
>
> They occur less frequently in real-world code than regular (non-circular)
> bit shifts, but are nevertheless directly supported in most popular ISAs.
> They are useful primitives for cryptography, non-cryptographic hashing,
> pseudo-random number generation, bit I/O, and probably other cases as well.
>
> They do not have a direct representation in C/C++, but several compilers
> support them via intrinsics (including MSVC, which Clang-CL tries to be
> compatible to), and some languages such as Rust include them explicitly.
>
> The status quo
> ==============
>
> Rotates by compile-time constant amounts are fairly easy to express in
> C-like languages by ORing together two opposite-direction shifts. For
> example, rotating a uint32_t  x left by 3 positions in C boils down to
>
>   (x << 3) | (x >> 29)
>
> which is simple enough. The backends with rotate support detect this type
> of pattern at the DAG stage and map it to a rotate instruction.
>
> This, in a nutshell, is the argument against a direct rotate in the IR:
> the existing primitives are sufficient to model it, and doing so allows
> LLVM to leverage transforms and analyses such as SimplifyDemandedBits that
> know how to deal with the more common regular shifts to work on bit rotate
> expressions as well. Adding a new primitive would mean that a whole bunch
> of optimization passes would need to know how to deal with it (or else they
> might lose on optimization opportunities).
>
> Variable-distance rotates
> =========================
>
> Things become murkier when we consider not just rotates by a compile-time
> constant, but also rotates where the amount is determined at run time.
>
> For one thing, it's not immediately obvious how to write a well-defined
> corresponding C expression: the most "natural" version (for a 32-bit left
> rotate) is
>
>   (x << k) | (x >> (32 - k))
>
> but this is undefined for k==0 due to the right-shift of a 32-bit value by
> 32. That said, both GCC and Clang will generally compile this particular
> expression into a rotate-left as of this writing, but having UB is
> unfortunate. A trickier variant that is always well-defined is
>
>   (x << (k & 31)) | (x >> (-k & 31))
>
> but this is a substantially more complicated sub-dag  than the OR of two
> shifts you would see for a compile-time constant rotate distance, and there
> are more opportunities for things to go wrong along the way. Bug 37387
> contains a few examples, the simplest of which was given by Sanjay:
>
> void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
>   for (unsigned i = 0; i < N; ++i)
>     x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop
> invariant
> }
>
> "32 - b" is loop-invariant and gets hoisted, and what's left of the
> expression doesn't match a known "rotate" pattern (since the DAG can't see
> across BB boundaries at the moment), so instead we get two shifts and an OR
> inside the loop.
>
> The counterargument here was that GlobalISel should eventually fix this
> particular problem, but nevertheless, as of today, LICM breaks recognition
> of the rotate pattern in this (very simple) loop, and there is reason to
> believe that other transforms (existing or forthcoming) might well break
> the pattern as well; the issues referenced in the linked bug illustrate
> some of the existing failure modes.
>
> This situation is frustrating for users of variable-distance rotates;
> right now, there simply is no way to write code in a source language that
> will reliably turn into rotates; you either have to watch the generated
> code closely, or just throw up your hands and use platform-dependent inline
> assembly. (It is assumed that the code in question is in a hot spot.)
>
> The proposal
> ============
>
> The idea would be to add rotate intrinsics, primarily intended to be used
> for the variable-distance case, which is otherwise fragile.
>
> Frontends can emit them directly (for source language-level intrinsics, or
> when matched from "known good" rotate patterns) and the backends can match
> them and select appropriate instructions.
>
> For the constant-distance case, the preferred form would still be the
> OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is
> known to existing optimization passes.
>
> SDB does not currently do anything with shifts by non-constant amounts
> anyway; and, for the use cases in question, a rotate that is not touched by
> other optimization passes is still preferable to a more complex expression
> that can be partially optimized, but then ends up in a form that the
> backend doesn't know how to turn back into a rotate, which is a net loss.
>
> Thoughts?
>
> -Fabian
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180515/3dc6ca1b/attachment.html>