Porting the Intel(r) OpenMP* run time library -- Jim Cownie 1 March 2014 This short docuement covers some of the areas that need thought when porting the runtime. It is split into two sections 1) Porting to a different operating system 2) Porting to a different architecture It is currentlyu more a pointer to places that you need to think about than a recipe for porting. Porting to a different operating system The runtime has already been ported to Linux, Mac OSX and Microsoft Windows. You should choose the version that is nearest the OS you're porting to. Threads: On Unix machines the runtime uses pthreads. If you can use them that should avoid a need for change. Affinity: For a first port I'd just cut it all out. When you want it you'll need system calls to discover and assert thread affinity to the OS. On linux the sched_{get,set)affinity calls are used. Locks: The runtime has way too many lock implementations :-). (Though since most don't rely on OS specific features [the futex lock is an exception] they should just work). Porting to a new architecture Have a look at the various places where assembler code is used. Clearly these will need to be fixed. However, (assuming that you're not using the Intel compiler, but LLVM or gcc to compile the user code), you don't need to implement all the complexity of packing multiple arguments to an outlined routine and then restoring them in the stack of the outlined routine. Both clang/LLVM and gcc generate outlined functions that have a single argument to point to the shared variables in the parent stack frame. Knowing this you can replace the complicated code that handles variable numbers of arguments with simpler code (all in C) that handles the one argument case. (We may optimise this path in the future) The runtime uses a variety of atomic operations; you'll need to work out how to implement them. They're used to implement locks and to implement atomic operations for the compiler. (#pragma omp atomic). A "compare and swap" is particularly useful here if you have one. Otherwise you may have to using locking to guard the operation, however that is normally significantly slower than a single atomic operation or a CAS. Cache line size: various structures are cache-line aligned. This improves performance. You don't need to do anything to get the code to run, but may want to senure that you have the correct cache-line size and think about how fields fit into the lines you have for tuning. Affinity: You'll need a way to determine the machine topology (i.e. which hardware threads are "near" each other, i.e. share caches; on X86 cpuid is used to find this information, or /proc/cpuinfo can be used on Linux)