[llvm-dev] Rotates, once again

Mon Jul 2 15:36:51 PDT 2018

On 7/2/2018 3:16 PM, Sanjay Patel wrote:
> I also agree that the per-element rotate for vectors is what we want for 
> this intrinsic.
> 
> So I have this so far:
> 
> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x  i32>  %a, <2 x i32> %b, <2 x i32> %shift_amount)
> 
> For scalars, @llvm.catshift concatenates %a and %b, shifts the 
> concatenated value right by the number of bits specified by 
> %shift_amount modulo the bit-width, and truncates to the original 
> bit-width.
> For vectors, that operation occurs for each element of the vector:
>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
> If %a == %b, this is equivalent to a bitwise rotate right. Rotate left 
> may be implemented by subtracting the shift amount from the bit-width of 
> the scalar type or vector element type.

Or just negating, iff the shift amount is defined to be modulo and the 
machine is two's complement.

I'm a bit worried that while modulo is the Obviously Right Thing for 
rotates, the situation is less clear for general funnel shifts.

I looked over some of the ISAs I have docs at hand for:

- x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. 
Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b 
instruction variants). As of BMI2, we also get RORX (non-flag-setting 
ROR) but no ROLX.

- ARM AArch64 has EXTR, which is a right funnel shift, but shift 
distances must be literal constants. EXTR with both source registers 
equal disassembles as ROR and is often special-cased in implementations. 
(EXTR with source 1 != source 2 often has an extra cycle of latency). 
There is RORV which is right rotate by a variable (register) amount; 
there is no EXTRV.

- NVPTX has SHF 
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) 
with both left/right shift variants and with both "clamp" (clamps shift 
count at 32) and "wrap" (shift count taken mod 32) modes.

- GCN has v_alignbit_b32 which is a right funnel shift, and it seems to 
be defined to take shift distances mod 32.

based on that sampling, modulo behavior seems like a good choice for a 
generic IR instruction, and if you're going to pick one direction, right 
shifts are the one to use. Not sure about other ISAs.

-Fabian