[Openmp-dev] Looking for contributors to target LLVM for open-source, multi-core GP-GPU-Compute Engine and RISC CPU

Jerry Harthcock via Openmp-dev openmp-dev at lists.llvm.org
Wed Jan 20 09:51:58 PST 2016

Dear LLVM and OpenMP Members,

The purpose of this communication is to bring your attention to the
availability of an open-source, multi-core GP-GPU-Compute engine, companion
RISC CPU and RISC Coarse-Grained Scheduler (CGS), all three of them
executing the same SYMPL ISA instruction-set (*see* press release below).

LLVM, including cycle-accurate instruction-set simulator and debugger,
still need to be targeted to support this ISA.  So if anyone would be
interested in initiating a re-targeting project, let me know, as I am sure
we can work out a horse-trade of some sort.

SYMPL GP-GPU-Compute Engine and SYMPL 32-bit RISC CPU repository:

Yours very truly,


*For Immediate Release*
*Open-Source, IEEE754-2008 Compliant, GP-GPU-Compute Engine gets 32-Bit
RISC CPU and Coarse-Grained Scheduler that Execute Same Instruction-Set*

Austin, TX--Designed for massively parallel, FPGA-accelerated, 32-bit
single-precision floating-point applications, the SYMPL ISA open-source RTL
library now includes not only the multi-core, interleaving multi-threading,
GP-GPU-Compute engine, but also now includes both a 32-bit RISC CPU and
32-bit Coarse-Grained Scheduler (CGS) that execute the same instructions as
the GP-GPU, making the CPU, GP-GPU and CGS combination the world's first
and only RISC CPU, GP-GPU and CGS to feature a homogeneous instruction-set

Presently available for free download at the SYMPL GP-GPU-Compute Engine
repository at GitHub, the Verilog RTL library includes sythesizable Verilog
RTL source-code for SYMPL CPU, GP-GPU, CGS models comprising the SYMPL CPU,
one to sixteen GP-GPUs and one to four CGSs.  Configuring the design is
easily done at the top level of the design--just follow the instructions
located at bottom of the “read-me” file at the SYMPL GP-GPU-Compute Engine

Also at the repository are example test cases for the single, dual, quad,
eight, and sixteen-shader (64-thread) implementations that can all be
simulated on Xilinx “free” version of Vivado FPGA development environment,
which can be downloaded at Xilinx.com.  One test case employs a combination
of SYMPL RISC CPU, one or more SYMPL GP-GPUs, and one or more SYMPL CGSs to
perform a 3D transformation of a 3D model in .stl file format and writing
the transformed object back to the Vivado simulator working directory,
again in .stl file format, so the results of the 3D transform can be viewed
using any online .stl file viewer, including the one built into GitHub.
Specifically, the example 3D transform rotates, scales and translates the
object on all three axes according to the amounts specified in the
parameter list in the CPU's program memory.  Below is a .gif showing the
“before” and “after” .stl-formated 3D object as viewed using GitHub's .stl
viewer with the “Surface Angle” display mode selected.

[image: Inline image 1]

The other example test case involves employing UC Berkeley's RISC-V
(VSCALE) CPU RTL model as the CPU in lieu of the SYMPL RISC CPU, but still
performing the same 3D transform on the same 3D model and writing the
result to the working directory, the purpose of which is to compare
performance of the UC Berkeley RISC-V CPU with the SYMPL RISC CPU, both
clocking at 100MHz.

In both cases, the role of the CPU is merely to push the 3D transform
parameters and triangles comprising the 3D object into the SYMPL
GP-GPU-Compute engine's 64k-word data-pool and then issue a command to the
GP-GPU's dedicated Coarse-Grained Scheduler that data is available and to
perform the 3D transform according to the parameters pushed into the
data-pool, then wait for the CGS to bring its “Done” signal back high,
indicating that the CGS has completed the task.  Upon sensing that the CGS
is done, the CPU then pulls the results out of the GP-GPU data-pool and
writes them back to their original location in its own memory space,
followed by a write of a semaphore value, which signals the test-bench that
processing has completed.  The test-bench then writes the results to the
working directory in .stl binary file format.  The transformed results can
then be viewed using any online 3D .stl file viewer.

Also in both cases, the role of the CGS is simply to distribute the
workload as evenly as possible among the available GP-GPU shader threads.
There are four interleaving threads per shader.  Thus, for a configuration
comprising four shaders, the maximum number of threads available to perform
the work is sixteen.  For sixteen shaders the maximum number of available
threads is sixty-four, and so on.  Every four shaders (sixteen threads) has
one CGS dedicated to them.  Most of the time, when a given CGS is not
actually pushing parameters and data into a given shader's parameter-data
buffer, it is polling its command register waiting for a command from the
CPU.  When it does receive a command, it then makes a determination as to
how many of its GP-GPU threads are available.  It does this by simply
counting how many of its shader's “Done” lines are asserted active high.
Then it's just a simple calculation: divide the number of triangles by the
number of available threads, such that the result of the divide is how many
triangles (plus a portion of any remainder from the divide operation) get
pushed into a given thread's parameter-data buffer.

To answer the question, “How does the UC Berkeley RISC-V (R32I) RTL model
compare to the SYMPL RISC CPU, in terms of time required to push the same
number of triangles when both are clocking at 100 MHz?”  Answer:  the
RISC-V requires roughly 80 usec and the SYMPL RISC CPU requires roughly 20
usec.  At first glance, this seems hard to believe, but this fact is
absolutely true.  If anyone would like to see for themselves, both test
cases are presently available for download at the SYMPL GP-GPU repository
at GitHub—just read the “read-me” file at GitHub for instructions on how to
set up the simulation.  Also at the repository are the original assembly
language source files and assembled object files, so you can review the
code yourself to make sure everyone is playing fair.

So, why the big difference?  The answer is simple:  the SYMPL ISA is based
on an enhanced Harvard, dual-operand “mover” architecture and not on the
outdated “load-store” model we were all taught in college.  The classic
load-store RISC model necessarily requires that each operand to a
computation first be loaded into one of the load-store CPU's internal
register-file register locations before a computation involving such
operands can be carried out, the results of such computation also being
stored in said register file, before it can be written out to memory, with
each step requiring at least one clock each.

The SYMPL ISA is very different in a number of very important respects.
Firstly, it per se has no register file, in that everything, including
program counter, is memory-mapped into the same data space as data memory,
such that the status register, program counter, indirect pointers
(AR0-AR3), stack pointer, etc., are essentially treated the same way as
memory.  Thus, one way to look at the SYMPL RISC model is, the entire
memory space “is” the register file and the data is already “loaded” into
it and available for computation as an operand.  Thus, no cycles need be
wasted loading a register file before a computation can be carried out.
This is especially true in the modern era, particularly as pertains to
modern FPGAs, where there are now literally megabytes of closely-coupled
memory on-chip, wherein much of the available memory is never used.
Consequently, as pertains to the newer FPGAs, designed for massively
parallel applications, there is no need for a per se register file to load
and hold operands before a computation can be carried out.

Secondly, unlike the classical load-store model that loads a single operand
at a time into its register file (at least one clock per operand), the
SYMPL ISA “mover” architecture reads two operands simultaneously, performs
the computation, and writes a result from the preceding computation back to
memory, all in one clock cycle.  To enable this capability, the SYMPL ISA
requires tri-ported SRAM having two independent read-side address/data
ports and one independent write-side port.  With today's larger FPGAs, such
as Xilinx Kintex 7, UltraScale and UltraScale+ FPGAs and Altera's Stratix-V
and Arria 10 FPGAs, this is not a problem because these devices have
megabytes of SRAM, both distributed in the fabric and in block form, which
is way more than anyone can reasonably use for most applications.  Building
a tri-port memory is easy.  Just sandwich two block SRAMs together and
connect the write-sides of each together.  The RTL in the SYMPL ISA library
show how to do this.

Thirdly, the SYMPL ISA has features absent from the RISC-V ISA, which
enable it to continuously read dual-operands, perform a computation between
them and write a result out every clock cycle without using unrolled loops
(which can consume lots of program memory) and without leaving gaping holes
in the instruction pipeline or individual floating-point operator
pipelines.  Chief among these features are four auxiliary registers that
function as indirect pointers and which have auto-post-modification
capability, meaning that these indirect pointers can be configured to
automatically post-increment, post-decrement, or remain unchanged after
each clock when used as a pointer.  When used in combination with the SYMPL
ISA RPT (“repeat”) instruction, the SYMPL RISC CPU can not only move data
around faster than a DMA channel, but it can also perform the same
computation on large blocks of data using just two instructions (RPT n
followed by the desired instruction, such as MOV, ADD, MUL, etc.), yielding
a result every clock cycle.

Just like the SYMPL GP-GPU, the SYMPL RISC CPU also has a complete
repertoire of IEEE754-2008 compliant, 32-bit, memory-mapped,
single-precision floating-point operators, including FADD, FSUB, FMUL,
FDIV, FMA, DOT, SQRT, LOG, EXP, ITOF and FTOI.  The floating-point
operators presently employed in both the SYMPL CPU and SYMPL GP-GPU were
generated using FloPoCo's floating-point generator.  Because
FloPoCo-generated floating-point operators, by themselves, are not
IEEE754-2008 compliant, additional logic was added to bring them into
conformance.  Namely, additional logic was added to enable “on-the-fly”
directed rounding, quiet NaN production with diagnostic payload for invalid
operation exceptions, capture registers with encoded diagnostics for
divide-by-zero and overflow exceptions to name a few.  Since
FloPoCo-generated operators flush subnormals to zero, the operator logic
was slightly modified to disable the flush, allowing results to
underflow—gradually—pursuant to the IEEE754-2008 specification.

To help prevent stalls while floating-point operations are underway, each
operator has associated with it sixteen, randomly addressable result
buffers that are thirty-five bits wide.  These three extra bits are encoded
to reflect, which, if any, floating-point exception occurred during
computation and can be used to programmatically trigger alternate delayed
exception handling the instant the result is read from its result buffer if
an exception occurred during its computation. The results of a given
operation are automatically binned-out to the memory-mapped result buffer
corresponding to same memory-mapped address the input operands were
originally written to.  Since the floating-point operator pipelines (which
vary from two to eleven clocks deep) are decoupled from the processor's
main instruction pipeline, such that the CPU and/or GP-GPU can, in rapid
succession, fill a given operator's pipe, such that, by the time the CPU or
GP-GPU has written the operands, the first result is already available for
reading from its respective result buffer, including the one originally
written after it.

Like the SYMPL GP-GPU floating-point operators, the SYMPL RISC CPU
floating-point operators can accept (and the SYMPL RISC CPU has the ability
to deliver) two new floating-point operands every clock cycle, especially
when a RPT instruction is employed in combination with the dual-operand MOV
instruction used to simultaneously write the two operands to the operator's
inputs.  As a result, the sixteen-shader version of the SYMPL CPU GP-GPU
combination can execute roughly 2.1 billion floating-point operations per
second when implemented in a Kintex 7 device clocking in the vicinity of
125 MHz.  To put this into perspective, clocking at 100 MHz, the  SYMPL
single-shader GP-GPU can perform a 323-triangle, 3D transformation on all
three axes, including rotate, scale, and translate, in roughly 225 usec.
In comparison, the sixteen-shader version can have results ready within
just 8 usec after the last input triangle is pushed into the last GP-GPU's
data-pool for processing.

Finally, also now included in the SYMPL ISA RTL library is the new SYMPL
Intermediate Language (IL) that can be used in lieu of, or in addition to,
SYMPL assembly language for writing SYMPL threads and programs.  SYMPL-IL
is very similar to a primitive form of the BASIC language.  For example,
instead of using the assembly language mnemonic for testing a bit, you can
now use a literal “IF” <condition> GOTO <destination>.  Another example is
the “FOR....NEXT” loop.  This new IL makes resulting code much easier to
read and understand than straight assembly, yet yields identical object
code produced by the same assembler.  The SYMPL ISA instruction table for
both the SYMPL assembler and SYMPL-IL is included with the library at the
GitHub repository at the following link:


*About SYMPL:  The Why*
The SYMPL GP-GPU-Compute project began in 2014 to address the lack of an
open-source GP-GPU accelerator so that anyone who wants to experiment with
their own home-brew or college-brew CPU can easily put it on steroids just
to see what it can do.  It is hoped that academia and industry will see the
merits in the open-source FPGA-accelerated GP-GPU-Compute concept and
collaborate to port LLVM and or GCC to support the SYMPL ISA.  In addition,
SYMPL still needs a cycle-accurate instruction-set simulator and debugger
that can work seamlessly with Eclipse Integrated Development Environment.
Hopefully, someday soon, SYMPL will be running Android and/or iOS
applications in an FPGA system designed by a bunch of college students or a
guy in his garage.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20160120/42358044/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: olive_trans_both.gif
Type: image/gif
Size: 1137460 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20160120/42358044/attachment-0001.gif>

More information about the Openmp-dev mailing list