Porting the Intel(r) OpenMP* run time library
-- Jim Cownie 1 March 2014

This short docuement covers some of the areas that need thought when
porting the runtime. It is split into two sections

1) Porting to a different operating system
2) Porting to a different architecture

It is currentlyu more a pointer to places that you need to think about than a
recipe for porting.

Porting to a different operating system

The runtime has already been ported to Linux, Mac OSX and Microsoft
Windows. You should choose the version that is nearest the OS you're
porting to.

Threads: On Unix machines the runtime uses pthreads. If you can use
them that should avoid  a need for change.

Affinity: For a first port I'd just cut it all out. 
When you want it you'll need system calls to discover and assert thread affinity to the OS. On
linux the sched_{get,set)affinity calls are used.

Locks: The runtime has way too many lock implementations :-). (Though
since most don't rely on OS specific features [the futex lock is an
exception] they should just work).




 

Porting to a new architecture

Have a look at the various places where assembler code is
used. Clearly these will need to be fixed. However, (assuming that
you're not using the Intel compiler, but LLVM or gcc to compile the
user code),  you don't need to implement all the complexity of packing
multiple arguments to an outlined routine and then restoring them in
the stack of the outlined routine. Both clang/LLVM and gcc generate
outlined functions that have a single argument to point to the shared
variables in the parent stack frame. Knowing this you can replace the
complicated code that handles variable numbers of arguments with
simpler code (all in C) that handles the one argument case.
(We may optimise this path in the future)

The runtime uses a variety of atomic operations; you'll need to work
out how to implement them. They're used to implement locks and to
implement atomic operations for the compiler. (#pragma omp atomic). A
"compare and swap" is particularly useful here if you have
one. Otherwise you may have to using locking to guard the operation,
however that is normally significantly slower than a single atomic
operation or a CAS. 

Cache line size: various structures are cache-line aligned. This
improves performance. You don't need to do anything to get the code to
run, but may want to senure that you have the correct cache-line size
and think about how fields fit into the lines you have for tuning.

Affinity: 
You'll need a way to determine the machine topology (i.e. which hardware threads
are "near" each other, i.e. share caches; on X86 cpuid is used to find this information, or
/proc/cpuinfo can be used on Linux)