[Openmp-dev] Is libomp open to add arch-specific barrier implementation?

misono.tomohiro@fujitsu.com via Openmp-dev openmp-dev at lists.llvm.org
Mon Nov 1 19:12:21 PDT 2021


Hello, list.

I'm Tomohiro Misono and software engineer at Fujitsu. 

I'm new to libomp community and I'd like to ask a question about libomp's development policy today. 
In short, as the title says, is libomp open to add arch(CPU)-specific barrier implementation?
I have Fujitsu A64FX's hardware assisted barrier implementation in mind. 

Below is more detailed background. 

A64FX processor[*] (which is for HPC and used in supercomputer Fugaku) has hardware assisted
barrier using architecture specific registers. This mechanism can be used to make a synchronization
within L2-share domains using these registers. Although Fujitsu has its own openmp runtime library
implementation to support this barrier, we are now considering if it is possible to support it in open
library (i.e. libomp) too. Based on my research, I think it would be possible to support the barrier in
libomp by adding a new barrier type which only works for specific architecture, but is this approach ok
for the community? 

[*] Specifications: https://github.com/fujitsu/A64FX 

Note that the code we have at this point is not easily incorporated into libomp and totally new 
development is required from scratch. Also, it requires kernel driver to be loaded to access the
registers (please see below). I just want to know if this plan is feasible in the first place before
starting development.

Some notes for possible implementation: 
 - A64FX's hardware barrier can perform synchronization within L2-share domains. Therefore
  conventional barrier by software (i.e. flag Class) is still needed for cross-L2 domain synchronization.
  So, the possible implementation would have some similarity in hierarchical barrier (only leaf can
  use hardware barrier).  I think expanding current hierarchical barrier code becomes messy and
  introducing a new barrier type is better 
 - In the optimal case (i.e. barrier within L2 domain), there is no need to use software barrier at all.
  Currently task execution is mainly coupled with flag Class and this needs to be addressed somehow
 - In order to use hardware barrier, each thread must be bound to its specific core and cannot be
  moved. If the condition does not meet, the library has to fallback to use software barrier.
  I think this restriction implies hardware barrier cannot be used at fork_barrier.
- Last but not least; In order to access the barrier registers on A64FX, linux kernel driver is needed.
  We are willing to open the driver code too (but it is not accepted linux kernel community at this point).
  The ultimate goal is determining user-kernel interface as general as possible so that code can be
  reused for both libomp and kernel driver if other new hardware assisted barrier implementation emerges,
  but this is a challenging problem. 

I'd appreciate any comments. 

Regards, 
Tomohiro


More information about the Openmp-dev mailing list