[llvm-dev] Adding new vector instructions to LLVM Sparc backend

Tue Feb 25 10:18:34 PST 2020

Hello all,

As a major degree project, I started working on adding vector instruction
to the LLVM Sparc(modify for AJIT processor) backend.

My work is to implement VADDD, VSUBD, VUMULD, VSMULD instructions.

Their instruction format is as follows:-

31-30 op (always 10)

29-25 rd

24-19 op3

18-14 rs1

13      i (always 1)

12-10      (unused)

9-7          (datatype 8->001, 16->010, 32->100)

6-5          (always 10)

4-0          (rs2)

https://llvm.org/docs/ExtendingLLVM.html suggest me to use LLVM Custom
Intrinsic to represent this VADDD operation. Is there any detail example
code for other architectures available to look at?

Am I need to define a new class in SparcInsFormat.td
<https://github.com/llvm-mirror/llvm/blob/master/lib/Target/Sparc/SparcInstrFormats.td#L106>
because these instructions can't use predefined format-3 class of other
arithmetic instructions(8-bit felid of asi changed to specify vector
datatype)?

Does the implementation of Sparc VIS
<https://github.com/llvm/llvm-project/blob/master/llvm/lib/Target/Sparc/SparcInstrVIS.td>
resemble with these instructions?

May some LLVM backend experts give me an initial idea on what steps should
I take to add these instructions?

I have gone through LLVM target-independent code generator documentation.

SPARC architecture manual and AJIT processor ISA is attached to the mail.
https://www.gaisler.com/doc/sparcv8.pdf

Thanks and Regards,
Shivam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200225/86745946/attachment-0001.html>
-------------- next part --------------
	64-bit ISA extensions to the AJIT processor
		Madhav Desai

1. Overview
--------

The AJIT processor implements the Sparc-V8 ISA.  We propose
to extend this ISA to provide support for a native 64-bit
integer datatype.  The proposed extensions use the existing
instruction encodings to the maximum extent possible.

All proposed extensions are RegisterXRegister -> Register,Condition-codes
type instructions.  The load/store instructions are not modified.

We list the additional instructions in the subsequent sections.
In each case, only the differences in the encoding relative
to an existing Sparc-V8 instruction are provided.

2. Integer-unit extensions: Arithmetic-logic instructions
-------------------------------------------------------

These instructions provide 64-bit arithmetic/logic support
in the integer unit.  The instructions work on 64-bit register
pairs in most cases.  Register-pairs are identified by a 5-bit
even number (lowest bit must be 0).

	ADDD			
		encoding:  same as ADD, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) + rs2(pair)

	ADDDCC
		encoding:  same as ADDCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) + rs2(pair), set Z,N

	SUBD
		encoding:  same as SUB, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) - rs2(pair)

	SUBDCC
		encoding:  same as SUBCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) - rs2(pair), set Z,N

	// shifts
	SLLD
		encoding:  same as SLL, but with Instr[6:5]=2.
			if imm bit (Instr[13]) is 1, then Instr[5:0] is the shift-amount.
			else shift-amount is the lowest 5 bits of rs2. Note that rs2
			is a 32-bit register.
		rd(pair) <-  rs1(pair) << shift-amount

	SRLD
		encoding:  same as SRL, but with Instr[6:5]=2.
			if imm bit (Instr[13]) is 1, then Instr[5:0] is the shift-amount.
			else shift-amount is the lowest 5 bits of rs2. Note that rs2
			is a 32-bit register.
		rd(pair) <-  rs1(pair) >> shift-amount

	SRAD
		encoding:  same as SRA, but with Instr[6:5]=2.
			if imm bit (Instr[13]) is 1, then Instr[5:0] is the shift-amount.
			else shift-amount is the lowest 5 bits of rs2. Note that rs2
			is a 32-bit register.
		rd(pair) <-  rs1(pair) >> shift-amount (with sign extension).

	// mul/div
	UMULD
		encoding:  same as UMUL, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) * rs2(pair)
	UMULDCC
		encoding:  same as UMULCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) * rs2(pair), sets Z, Ovflow

	SMULD
		encoding:  same as SMULD, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) * rs2(pair) (signed)
	SMULDCC
		encoding:  same as SMULCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) * rs2(pair) (signed)
		sets condition codes Z,N,Ovflow

	UDIVD
		encoding:  same as UDIV, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) / rs2(pair)
			note: can generate div-by-zero trap.
	UDIVDCC
		encoding:  same as UDIVCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) / rs2(pair)
			sets condition codes Z,Ovflow
			note: can generate div-by-zero trap.
	SDIVD
		encoding:  same as SDIV, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) / rs2(pair) (signed)
	SDIVDCC
		encoding:  same as SDIVCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) / rs2(pair) (signed)
			sets condition codes Z,N,Ovflow
			note: can generate div-by-zero trap.

	// 64-bit logical.
	ORD
		encoding:  same as OR, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) | rs2(pair)
	ORDCC
		encoding:  same as ORCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) | rs2(pair), sets Z.
	ORDN
		encoding:  same as ORN, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) | (~rs2(pair))
	ORDNCC
		encoding:  same as ORNCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) | (~rs2(pair)), sets Z
		sets Z.

	XORD
		encoding:  same as XOR, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) ^ rs2(pair)
	XORDCC
		encoding:  same as XORCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) ^ rs2(pair), sets Z
		sets Z.

	XNORD
		encoding:  same as XNOR, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) ^ rs2(pair)
	XNORDCC
		encoding:  same as XNORCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) ^ rs2(pair), sets Z

	ANDD
		encoding:  same as AND, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) . rs2(pair)
	ANDDCC
		encoding:  same as ANDCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) . rs2(pair), sets Z
	ANDDN
		encoding:  same as ANDN, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd(pair) <- rs1(pair) . (~rs2(pair))
	ANDDNCC
		encoding:  same as ANDNCC, but with Instr[13]=0 (i=0), and Instr[5]=1.
		rd <- rs1 . (~rs2), sets Z

3. Integer-unit extensions: SIMD instructions
-------------------------------------------------------
	These instructions are vector instructions which work on
	two source registers (each a 64 bit register pair), and
	produce a 64-bit vector result.  The vector elements can
	be 8-bit/16-bit/32-bit.

	VADDD8, VADDD16, VADDD32
		encoding:  same as ADDD, but with Instr[13]=0 (i=0), and Instr[6:5]=2.
			   bits Instr[9:7] are a 3-bit field, which specify the data
			   type
				001   byte			(VADDD8)
				010   half-word (16-bits)	(VADDD16)
				100   word (32-bits) 		(VADDD32)
		performs a vector operation by considering the 64-bit operands as	
		a vector of objects with specified data-type.

		vadd8 rs1,rs2, rd
		vadd16
		vadd32

	VSUBD8, VSUBD16, VSUBD32
		encoding:  same as SUBD, but with Instr[13]=0 (i=0), and Instr[6:5]=2.
			   bits Instr[9:7] are a 3-bit field, which specify the data
			   type
				001   byte 			(VSUBD8)
				010   half-word (16-bits)	(VSUBD16)
				100   word (32-bits) 		(VSUBD32)
		performs a vector operation by considering the 64-bit operands as	
		a vector of objects with specified data-type.

	VUMULD8, VUMULD16, VUMULD32
		encoding:  same as UMULD, but with Instr[13]=0 (i=0), and Instr[6:5]=2.
			   bits Instr[9:7] are a 3-bit field, which specify the data
			   type
				001   byte			(VMULD8)
				010   half-word (16-bits)	(VMULD16)
				100   word (32-bits) 		(VMULD32)
		performs a vector operation by considering the 64-bit operands as	
		a vector of objects with specified data-type.
	VSMULD8, VSUMLD16, VSMULD32
		encoding:  same as SMULD, but with Instr[13]=0 (i=0), and Instr[6:5]=2.
			   bits Instr[9:7] are a 3-bit field, which specify the data
			   type
				001   byte			(VSMULD8)
				010   half-word (16-bits)	(VSMULD16)
				100   word (32-bits) 		(VSMULD32)
		performs a vector operation by considering the 64-bit operands as	
		a vector of objects with specified data-type.

4. Integer-unit extensions: SIMD instructions
-------------------------------------------------------
	These instructions are vector instructions which reduce
	a source register to a byte result. 

	// byte-reduce or
	ADDDBYTER
		op=2, op3[3:0]=0xd, op3[5:4]=0x2, contents[7:0] of rs2 specify a mask.
		encoding
		Instr[31:30] (op) = 0x2
		Instr[29:25] (rd)    32-bit register.
		Instr[24:19] (op3) = 101101
		Instr[18:14] (rs1)   lowest bit assumed 0.
		Instr[13]    (i)  = 0 (ignored)
		Instr[12:5]   (zero)
		Instr[4:0]   (rs2)   32-bit register is read.

		rd <- (rs1_7.m7 + rs1_6.m6 + rs1_5.m5 ... + rs1_0.m0)
			(The final sum will be a 13-bit number, stored
			 in the least significant bytes.  It
			 is up to software to decide which byte(s) to
			 use).

		addbyter %rs1, %rs2/imm,  rd

	// byte-reduce or
	ORDBYTER
		op=2, op3[3:0]=0xe, op3[5:4]=0x2, contents[7:0] of rs2 specify a mask.
		encoding
		Instr[31:30] (op) = 0x2
		Instr[29:25] (rd)    rd is a 32-bit register.
		Instr[24:19] (op3) = 101110
		Instr[18:14] (rs1)   lowest bit assumed 0.
		Instr[13]    (i)  = 0 (ignored)
		Instr[12:5]   (zero)
		Instr[4:0]   (rs2)   32-bit register is read.

		rd <- (rs1_7.m7 | rs1_6.m6 | rs1_5.m5 ... | rs1_0.m0)

	// byte-reduce and
	ANDDBYTER
		op=2, op3[3:0]=0xf, op3[5:4]=0x2, contents[7:0] of rs2 specify a mask.
		encoding
		Instr[31:30] (op) = 0x2
		Instr[29:25] (rd)    rd is a 32-bit register.
		Instr[24:19] (op3) = 101111
		Instr[18:14] (rs1)   lowest bit assumed 0.
		Instr[13]    (i)  = 0 (ignored)
		Instr[12:5]   (zero)
		Instr[4:0]   (rs2)   32-bit register is read.
		rd <- ( (m7 ? rs1_7 : 0xff) . (m6 ? rs1_6 : 0xff) .... (m0 ? rs1_0 : 0xff))

	// byte-reduce xor
	XORDBYTER
		op=2, op3[3:0]=0xe, op3[5:4]=0x3, contents[7:0] of rs2 specify a mask.
		encoding
		Instr[31:30] (op) = 0x2
		Instr[29:25] (rd)    rd is a 32-bit register.
		Instr[24:19] (op3) = 111110
		Instr[18:14] (rs1)   lowest bit assumed 0.
		Instr[13]    (i)  = 0 (ignored)
		Instr[12:5]   (zero)
		Instr[4:0]   (rs2)   32-bit register is read.
		rd <- (rs1_7.m7 ^ rs1_6.m6 ^ rs1_5.m5 ... ^ rs1_0.m0)

	// positions-of-zero-bytes in d-word.
	ZBYTEDPOS
		op=2, op3[3:0]=0xf, op3[5:4]=0x3, contents[7:0] of rs2/imm-value specify a mask.
		encoding
		Instr[31:30] (op) = 0x2
		Instr[29:25] (rd)    rd is a 32-bit register.
		Instr[24:19] (op3) = 111111
		Instr[18:14] (rs1)   lowest bit assumed 0.
		Instr[13]    (i)  =  if 0, use rs2, else Instr[7:0]
		Instr[12:5]  = 0  (ignored if i=0)
		Instr[4:0]   (rs2, if i=0)   
				32-bit register is read.

		rd <- [b7_zero b6_zero b5_zero b4_zero .. b0_zero]
			(if mask-bit is zero then b*_zero is zero) 

5.  Vector floating point instructions
---------------------------------------
	These are vector float operations which work
	on two single precision operand pairs to 
	produce two single precision results.

	// SIMD float ops.
	//   NaN propagated, but no traps.
	//   For each of these, rs1,rs2,rd are
	//   considered even numbers pointing to
	//   a floating point register-pair.
	//
	VFADD
		op=2, op3=0x34, opf=0x142

		vfadd %f1, %f2, %f3

	VFSUB
		op=2, op3=0x34, opf=0x146
	VFMUL
		op=2, op3=0x34, opf=0x14a
	VFDIV
		op=2, op3=0x34, opf=0x14e
	VFSQRT
		op=2, op3=0x34, opf=0x12a

6.  CSWAP insruction
---------------------------------------

	The Sparc-V8 ISA does not include a compare-and-swap (CAS) instruction
	which is very useful in achieving consensus among distributed agents
	when the number of agents is > 2.

	We introduce a CSWAP instruction in two flavours

		CSWAPD     rs1, rs2-pair/immediate, rd-pair
			op=3
			op3= 10 1111
				(rest of instruction similar to SWAP)

		CSWAPDA    rs1, rs2-pair/immediate, rd-pair, asi
			op=3
			op3= 11 1111
				(rest of instruction similar to SWAPA)

	The semantics of the instruction (the entire sequence is atomic)
		TMPVAL = mem[rs1]  (load double, lock system bus)
		if <rs2-pair/immediate> == TMPVAL 
			(store double, unlock) mem[rs1] = <rd-pair>  
			<rd-pair>  = TMPVAL
		else
			(store double, unlock) mem[rs1] = TMPVAL

	The write under else is redundant but is required in order to unlock the bus.

	Similar to SWAP, 
		- mem[rs1] is left either with its value prior to the instruction or 
			with the value in rd-pair.
		- <rd-pair> is left either with its value prior to the instruction or
			with the value in mem[rs1].
	The processor can check rd-pair after execution to confirm if the swap
	succeeded.