[cfe-dev] [RFC][HIPSPV] Emitting HIP device code as SPIR-V

Fri Aug 13 08:05:19 PDT 2021

On 8/12/21 2:04 PM, Liu, Yaxun (Sam) via cfe-dev wrote:
> [AMD Public Use]
>
> + Artem for awareness and comments.
>
> Overall I support this proposal.

Same here. Since I want to reuse generic parts for OpenMP offloading we 
should keep
the non-HIP parts generic :)

~ Johannes

>
> My comments are below.
>
> Thanks.
>
> Sam
>
> -----Original Message-----
> From: Henry Linjamäki <henry.linjamaki at parmance.com>
> Sent: Monday, August 9, 2021 2:57 AM
> To: cfe-dev at lists.llvm.org
> Cc: llvm-dev at lists.llvm.org; Pekka Jääskeläinen <pekka.jaaskelainen at parmance.com>; Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> Subject: [RFC][HIPSPV] Emitting HIP device code as SPIR-V
>
> [CAUTION: External Email]
>
> Hi all,
>
> HIP is a C++ Runtime API and kernel language that allows developers to create portable applications for AMD and NVIDIA GPUs from a single source code [0]. There are also projects for running HIP code on Intel GPU platforms via the Intel Level Zero API [1] called HIPLZ [3] and HIPCL [2], which runs HIP programs in OpenCL devices with certain advanced features supported. Both of these backends consume SPIR-V binaries.
>
> We are proposing a patch set to be upstreamed that enables SPIR-V emission through the HIP code path. The end goal of the patches to be submitted is to emit SPIR-V binaries from HIP device code so it can be embedded into executables for OpenCL-like environments (at least for starters). Our current focus is on the two above-mentioned projects, HIPCL and HIPLZ which are both work-in-progress HIP implementations. They itself do not consume SPIR-V, but the device binaries are handed over to the OpenCL and Intel Level Zero APIs, respectively.
>
> Coarsely, the current process of translating the HIP code to SPIR-V in LLVM/Clang involves:
>
> * Retargeting HIP device code generation to the SPIR-V target.
> * Mapping address spaces in HIP to corresponding ones in SPIR-V.
> * Expanding HIP features, which can not be directly modeled in SPIR-V
>    (e.g. dynamic shared memory).
>
> The HIPSPIRV experimental branch is available at [4]. Note that it is not yet in a state we intend to propose for upstreaming, but shaping up the patches is a work in progress. Before proceeding to shape up and submit the patches, we would like to get feedback for the plans we have for upstreaming. In the following sections, we open up the above points further and sketch our plans for changes to LLVM (mostly to the Clang tool) to achieve the goal.
>
> Retargeting device codegen
> ==========================
>
> For making the HIP toolchain to emit and embed SPIR-V we are tentatively planning the following changes to the LLVM/Clang:
>
> * Introduce, at minimum, a 'spirv64' architecture type in Triple. This
>    is what the SPIR-V backend [5] (SPIR-V BE) effort is planning to
>    upstream. We would like to upstream this change in advance to
>    specify the HIP SPIR-V device code target, potentially before the
>    SPIR-V BE work lands.
>
> * Implement a new SPIRVTargetInfo and fill it with necessary
>    information. For HIPCL/-LZ we are planning to adjust the address
>    space mapping in a way which is discussed later in the 'address
>    space mapping' section.
>
> * Introduce a clang option to override the HIP device code target. We
>    are interested in the option '--offload=<target>' discussed in the
>    'Unified offload option for CUDA/HIP/OpenMP'-thread [6]. This option
>    would suit this use case well. As far as we know, the subject has
>    not advanced further from the discussion - is anyone working on it?
>
> [Sam] I am working on it but I got distracted by other work, so it may take some time. I would suggest to use --offload-arch=spirv for now. If you have partial implementation of --offload option you may upstream it.
>
> * Compilation driver:
>
>    HIP offload builder is changed to retrieve the offload device target
>    from the --offload option. If it is not present, it can fall back to
>    AMD's default target for avoiding changing the current default HIP
>    compilation behavior.
>
>    Temporarily change Driver to force clang to emit LLVM bitcode for
>    SPIR-V targets in the backend compilation phase. Otherwise, the
>    compilation will fail due to the lack of the real SPIR-V BE in many
>    parts of the code. Reworked HIPToolChain takes care of translating
>    the bitcode to SPIR-V during the linking phase. When the SPIR-V BE
>    lands in LLVM, we can revert this change.
>
> * Introduce 'hipspv' as an OS or environment type in Triple. The
>    primary and the current use of the type is to select device offload
>    toolchain for HIPCL/-LZ.
>
> * Implement a new toolchain class 'HIPSPVToolChain' in clang which is
>    selected when the HIP device target is specified to be
>    'spirv64-unknown-hipspv' with the --offload option. Since the SPIR-V
>    BE might not land in LLVM soon, we will set up the compilation flow
>    to produce the SPIR-V binary by using the LLVM-SPIR-V translator [7]
>    which is used in our experimental branch.
>
>    One important thing the toolchain does is to run one or several LLVM
>    IR passes, which are needed by the HIPCL/LZ runtime, on the final
>    fully linked device bitcode. The passes are required to be run
>    during link time - all user specified device code and HIPCL/LZ
>    device library routines have to be visible when the passes are
>    run. The reason for the requirement is explained in the 'HIP code
>    expansion' section. HIPSPVToolChain will use the opt tool for
>    running the passes at link time.
>
> * Currently, HIPToolChain is derived from ROCmToolchain and its long
>    chain of super classes (AMDGPUToolChain, Generic_ELF and
>    Generic_GCC). The new upstreamed target would not logically belong
>    under the AMDGPU/ROCm family so it does not make sense to derive the
>    HIPCL toolchain from the HIP toolchain. Therefore, we propose to:
>
>    - Create a new base HIP tool chain, 'BaseHIPToolChain' or just
>      'HIPToolChain', derived directly from ToolChain and put any
>      HIP-related code that is common or that can be reused in the
>      derived toolchains there.
>
>    - Derive a new HIPSPVToolChain from HIPToolChain.
>
>    - Rebase the HIPToolChain under the HIPToolChain and rename it to
>      HIPAMDToolChain. Since the current HIPToolChain depends on methods
>      in the super classes (e.g. AMDGPUToolChain's getParsedTargetID)
>      the rebased class is planned to be a proxy class to avoid code
>      duplication and to reduce the amount of changes. Another option to
>      refactor the current HIPToolChain would be to use multiple
>      heritance but that leads to dreaded diamond class structure which
>      probably is not a great choice.
>
>    With the current plan, HIPToolChain is not going to have much code
>    to be shared with the derived classes - so far only a bit of the
>    "fat binary" construction code is in sight for sharing, so the
>    immediate gains for the effort seems small. However, The TC's layout
>    is more logical and it may spark more HIP implementations, as well
>    as help refactoring when going forward.
>
> [Sam] HIPToolchain currently inherits ROCMToolChain (https://clang.llvm.org/doxygen/classclang_1_1driver_1_1toolchains_1_1ROCMToolChain.html ). ROCMToolChain provides information about device libraries which are shared by OpenCL, HIP and OpenMP. We could rename the current HIPToolChain as AMDGPUHIPToolChain. I can see SPIRVHIPToolChain or HIPSPVToolChain is a toolchain which provides include files and device library for HIP on SPIRV. However I don't see there is a generic HIP toolchain since there is little shared information between AMDGPUHIPToolChain and HIPSPVToolChain. I would suggest to introduce some helper class e.g. HIPToolChainUtility to hold the shared utilities by different HIP toolchains.
>
> Address space mapping
> =====================
>
> Translating HIP device code to valid SPIR-V binary requires tweaks on
> pointers:
>
> Pointers without address space (AS) qualification in HIP programs are considered "flat" pointers - they can point to function local, __device__, __shared__ and __constant__ memory space dynamically, which matches the idea of 'generic' pointers introduced in OpenCL 2.0. Therefore, the logical choice for the flat pointers is to map them to generic pointers of SPIR-V's OpenCL environment. HIPCL's and HIPLZ's SPIR-V environment mandates that the kernel pointer parameters must point to __global, __local or __constant memory (these are named differently in SPIR-V; using OpenCL names as they are more familiar). So HIP pointer parameters in the HIP kernel (__global__) functions would be mapped to global pointers. Otherwise, HIP pointers with AS qualifiers are mapped to SPIR-V equivalent, if suitable.
>
> Now, there are significant differences between HIP's __constant__ and SPIR-V/OpenCL's constant address space:
>
> * In HIP, __constant__ globals can be altered on the host side with
>    the hipMemcpyToSymbol() API function. In the OpenCL's host API you
>    cannot do this.
>
>    (Side-note: OpenCL host API does not have an equivalent method for
>    hipMemcpyToSymbol but HIPCL currently supports hipMemcpyToSymbol for
>    the global __global variables via Intel's
>    clGetDeviceGlobalVariablePointerINTEL API extension, but we are
>    planning to inject shadow kernel commands that access the global
>    variables instead for portability.)
>
> * In HIP flat pointers can point to __constant__ memory. In OpenCL
>    this is not the case with __generic pointers, which means __constant
>    pointers cannot be casted to __generic pointers and vice versa.
>
> There are a couple ways to deal with constants:
>
> * Map __constant__ to __global space in SPIR-V. That way we can
>    generate code that works and is simple to implement. Of course, we
>    lose the optimization/placing benefits of constant memory.
>
> * Transform the code after clang codegen (by an LLVM pass) by
>    converting the __constant objects to kernel arguments. This covers
>    the hipMemcpyToSymbol() case. There is still the constant-to-generic
>    cast issue, so we would have to use the previous point as the
>    fallback.
>
> We plan to start by upstreaming the first option, and time permitting, improve by implementing the second option.
>
> [Sam] I doubt the second approach would work since users may change the constant variable in host code at arbitrary places and compiler is not able to get the correct value and pass it through kernel arg. I would suggest to extend SPIRV to represent externally_initialized attribute and translate __constant__ variable to constant addr space with externally_initialized attribute. However, translating it to global addr space is OK for first step.
>
> The planned changes to Clang to achieve the aforementioned AS mapping are as follows:
>
> * Define address space mapping in the new, aforementioned
>    SPIRVTargetInfo to map CUDA address spaces (which the HIP reuses) to
>    do the mapping mentioned earlier. Default AS (0) used for the flat
>    pointers are mapped to the SPIR-V's 'generic'. We intend this
>    mapping being enabled when the language mode is HIP.
>
> * Change SPIRABIInfo to coerce kernel AS-unqualified pointer arguments
>    to __global ones. Pointer arguments in regular device functions
>    receive the __generic AS qualifier via the address space mapping
>    defined in SPIRVTargetInfo in the above point.
>
>
> HIP code expansion
> ==================
>
> There are features in HIP language which do not have direct counterparts in SPIR-V's OpenCL environment and those features need to be rewritten before translation to SPIR-V (in the future, lowering to SPIR-V machine code through the new BE). The non-exhaustive list of features that need to be expanded includes:
>
> * Dynamic shared memory allocation (DSM): It is an array which is
>    declared globally in LLVM IR and its actual size determined at
>    kernel launch. OpTypeRuntimeArray in SPIR-V is the closest thing to
>    model this object, alas, it requires shader capability.
>
> * abort() builtin: No counterpart in SPIR-V/OpenCL.
>    (Note: the behavior is not well specified in the HIP spec
>    either. Assuming it terminates the whole grid if any work item
>    reaches it. AMD's abort definition calls __builtin_trap).
>
> * printf(): OpenCL's printf takes the format string as '__constant__
>    char*' while in HIP the format string does not have to reside in
>    constant memory.
>
> * Texture objects. These roughly correspond to image and sampler
>    objects of OpenCL combined. Also, texture objects carry more
>    information for the texture functions than image+sampler objects do.
>
> * Texture references. Same as above but these are program global
>    objects. In OpenCL, image objects cannot reside in the program
>    global space.
>
> HIPCL/-LZ's solution to the DSM allocation case is that the runtime allocates a shared buffer and passes it to the kernel as an additional argument (which is hidden from the user). The device code is modified so that the DSM object is replaced with the new kernel argument. Various other cases listed will be handled similarly:
>
> * For the printf case we tentatively replace the printf calls with a
>    function that packs their arguments to an additional buffer passed
>    as additional kernel argument and do the printing on the host side.
>
> * Texture objects will be tentatively split to image and sampler
>    objects and possibly auxiliary struct to carry texture
>    settings. This means at least that the kernel parameter listing
>    needs to be rewritten for the Texture objects.
>
> * For the texture reference we tentatively planned replacing the
>    global texture objects also with a number of additional kernel
>    arguments.
>
> For this and other HIP features we need to apply LLVM IR passes to perform modifications on the device code. In many cases the passes should be run when the device code (as LLVM bitcode) is fully linked. This is simply achieved as the HIP offload mechanism already emits device code as LLVM bitcode in RDC mode (-fgpu-rdc), so during linking we do receive the device code as LLVM bitcode where to apply these expansions with full view of the device code.
>
> The current plan for implementing this is to make the HIPSPVToolChain to build a linker that uses llvm-link for linking device code, opt for running the IR passes needed and the external llvm-spirv tool (llc in the future when the SPIR-V BE lands) for emitting the SPIR-V binary. We load the passes from a path the user provides via --hip-link-pass-path (name pending) or automatically from HIP runtime's installation location by using the search logic provided by ROCmInstallationDetector.
>
> There is interest in upstreaming the HIPCL/-LZ passes from the HIPCL/-LZ repositories in the future for reduced maintenance burden. However, we are not attempting to upstream them initially, as they are not yet completed and are subject to rapid changes. Question
> is: Where should the passes eventually be put in within the LLVM project tree? Could it be OK to add a new directory under Clang for tool chain passes?
>
> [Sam] For transformation which can be performed per TU, they may be put in to clang/lib/CodeGen/CGSPIRVRuntime.cpp like CGCUDANV.cpp, and called during finalization of LLVM module. For transformations which need to be performed after linking, we may consider llvm/lib/Frontend/SPIRV like llvm/lib/Frontend/OpenMP.
>
> Testing
> =======
>
> We will provide llvm-lit tests for our toolchain in the upstream. We also want to add tests to make sure clang who will run the HIPCL/-LZ runtime passes get run at device code link time. For this we need a dummy pass plugin that the clang loads during the test.
>
> When the new LLVM SPIR-V BE work lands on LLVM, we will add SPIR-V assembly checks that are relevant for HIPSPV.
>
>
> References
> ==========
>
> [0]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Frocmdocs.amd.com%2Fen%2Flatest%2FProgramming_Guides%2FProgramming-Guides.html&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=f2QlCOLRo7teKetkoCeWPmCHaIe5eB4fP%2BtnpG%2BR5og%3D&reserved=0
> [1]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspec.oneapi.com%2Flevel-zero%2Flatest%2Findex.html&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jo8myEHuGtkROtYyn5gy7WmeY8y%2BeMCeGKgqGNRCPzo%3D&reserved=0
> [2]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcpc%2Fhipcl&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1wy%2F2DdXGoAEHkT6MOtYcqZNt3XhbOfgC6wGfs52%2BRA%3D&reserved=0
> [3]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjz10%2Fanl-gt-gpu&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VN7PrfCbflJCVSJBzAbxu%2BcomcqTivxMwLW5EmYFpLo%3D&reserved=0
> [4]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fparmance%2Fllvm-project%2Ftree%2Fhip2spirv-v5&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZP%2ByDJKWzCr7hyjpQrPHUcKK%2FGrnp%2FjKxURkTnUgqeg%3D&reserved=0
> [5]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FKhronosGroup%2FLLVM-SPIRV-Backend&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407897076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=05dO4KmXPG2o%2BlSvqt2MZkpZJHwrXC9ETII2LOc96eo%3D&reserved=0
> [6]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fpipermail%2Fcfe-dev%2F2020-December%2F067362.html&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407907072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=teaPW4FHCI2lbiQ41M4IYGqGoz%2FQZ34qrpIVb%2FkLIf8%3D&reserved=0
> [7]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FKhronosGroup%2FSPIRV-LLVM-Translator&data=04%7C01%7Cyaxun.liu%40amd.com%7Cef2d8de4a8b9405e3af108d95b02ed76%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637640890407907072%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HYFXKtLMke8nyyLEl8sqtutefKRAvQgaif6TguMIYuU%3D&reserved=0
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev