[cfe-dev] [RFC] [OpenCL] Split opencl-c.h header

Fri Aug 31 08:38:42 PDT 2018

I submitted a patch on Phabricator (https://reviews.llvm.org/D51544),
but I'd like to get more inputs from OpenCL users to avoid breaking
anyone's workflow.

Let me know if you have any comments/suggestions.

TL;DR
---------------
This patch splits huge opencl-c.h header into multiple headers to
support efficient use of Precompiled Headers (or Modules):

  - opencl-c-defs.h contains all preprocessor macros. Macros are not
    saved in PCHs.

  - opencl-c-common.h contains builtins which do not depend on any
    preprocessor macro (extensions, OpenCL C version).

  - opencl-c-fp16.h and opencl-c-fp64.h contain builtins which require
    either cl_khr_fp16 or cl_khr_fp64 macros, but they do not depend
    on other extensions and OpenCL C version.

  - opencl-c-platform.h (looking for a better name! 'target' maybe?)
    contains all other builtins which have more complicated
    requirements.

Umbrella header opencl-c.h includes all headers above, so this change
is backward compatible with the original design.

Details
---------------
Typical OpenCL compiler implicitly includes opencl-c.h before
compiling user code:

  #include "opencl-c.h"
  __kernel void k() { printf("hello world\n"); }

With this approach, even for tiny programs compiler spends most of its
time on parsing of opencl-c.h header (which has more than 16000 LOC),
so it takes ~1s to compile our hello world example into LLVM IR.

Obvious solution for this problem is to compile opencl-c.h into an AST
only once, and then use this AST to compile user code. This feature
can be implemented using Precompiled Headers or Clang Modules, but it
has one major drawback: AST must be built with the same set of
preprocessor macros and the same target triple as the the user
code. Consider the following example:

opencl-c.h:
  #if __OPENCL_C_VERSION__ >= CL_VERSION_2_0
    float fract(float x, float *iptr);
  #else
    float fract(float x, __global float *iptr);
  #endif

If we compile this opencl-c.h into an AST, only one of these two
functions will be present. We cannot use this AST to compile for both
OpenCL C 1.2 and 2.0.

Another example: if we compile opencl-c.h into an AST with
spir-unknown-unknown (32bit) target triple, AST will have 'int'
instead of 'size_t':

opencl-c.h:
  typedef __SIZE_TYPE__ size_t; // __SIZE_TYPE__ is defined to int by clang
  size_t get_global_size(uint dimindx);

This makes the AST non-portable, and it cannot be used to compile for
spir64 triple (where size_t is 64bit integer).

If we want compiler to support CL1.2/CL2.0 for spir/spir64 triple,
we'll need to use 4 different PCHs compiled with different flags:

  -cl-std=CL1.2 + -triple spir
  -cl-std=CL1.2 + -triple spir64
  -cl-std=CL2.0 + -triple spir
  -cl-std=CL2.0 + -triple spir64

Things are getting worse if we want to support multiple devices, which
have different sets of OpenCL extensions. For example, if we want to
support both CPU (which supports cl_khr_fp64) and GPU (which supports
cl_khr_fp16), we must compile opencl-c.h using different options,
because it has the following code all over the place:

  #ifdef cl_khr_fp64
    uchar __ovld __cnfn convert_uchar(double);
  #endif
  #ifdef cl_khr_fp16
    uchar __ovld __cnfn convert_uchar(half);
  #endif

So if we want to add these 2 devices, we now need 4*2 different PCHs,
since every combination of -cl-std and -triple must be compiled twice
with different OpenCL extensions defines.

Size of each PCH is 2.5M, so we need ~20M of memory to store our
PCHs. If we want to add more devices or support another OpenCL C
version, this size will double.

For a C/C++ compiler this is not a problem: clang maintains a cache so
that only combinations which are actually used are saved to a
disk. OpenCL compilers are different, because they are often
distributed as a shared library. Being a library, it is not expected
to write something on a disk, because it rises a number of questions,
such as: "where do we have a writable temporary directory?" and "how
much space we can consume for PCHs?". For some targets (say, embedded
devices) it is better to avoid this approach.

To solve this problem, the patch splits opencl-c.h header into
multiple headers.

First, we split out 'common' functions, which should be present
regardless of OpenCL C version or extensions. All builtins which have
'core' types, such as convert, math and simple image builtins are
moved into opencl-c-common.h (8K LOC).

Second, split out all functions which are 'common' between OpenCL
version, but require either cl_khr_fp16 or cl_khr_fp64 (not both at
the same time!). Two headers (opencl-c-fp16.h and opencl-c-fp64.h)
have 2K LOC each.

Everything else goes into opencl-c-platform.h (5K LOC).

All macros go in opencl-c-defs.h, because macros cannot be
pre-compiled into AST.

With this setup, we can compile opencl-c-common.h, opencl-c-fp16.h and
opencl-c-fp64.h into PCHs with one set of extensions/OpenCL version,
and use them for any other set of extensions/OpenCL version. Clang
will detect this and throw out an error, which can be safely disabled
by -fno-validate-pch option.

opencl-c-platform.h (5K LOC) must still be pre-compiled for each
supported combination, but since it is a lot smaller (~0.5M vs
original 2.5M), it is not that bad. Or we can sacrifice some
performance and leave this header without pre-compilation: large
portion of opencl-c-platform.h contains vendor extensions, which will
be removed by preprocessor anyway.

-- 
Andrew

--------------------------------------------------------------------
Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park,
17 Krylatskaya Str., Bldg 4, Moscow 121614,
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.