[llvm-dev] LLVM GPU News Issue #16, July 23 2021

Fri Jul 23 15:37:12 PDT 2021

Hi folks,

The 16th issue of LLVM GPU News, a bi-weekly newsletter on all
the GPU things under the LLVM umbrella, is out:
<https://llvm-gpu-news.github.io/2021/07/23/issue-16.html>.

I also pasted the content below, in case you prefer to read in your email
client.

-Jakub

======================================================================

# LLVM GPU News Issue #16, July 23 2021
Authors: Jakub Kuderski, Johannes Doerfert, Joseph Huber

Welcome to LLVM GPU News, a bi-weekly newsletter on all the GPU things
under the LLVM umbrella.
This issue covers the period from July 2 to July 22 2021.

We welcome your feedback and suggestions. Let us know if we missed anything
interesting, or want us to bring attention to your (sub)project, revisions
under review, or proposals. Please see the bottom of the page for details
on how to submit suggestions and contribute.

##  LLVM and Clang

### Discussions

*  [Frank Winter asked](
https://lists.llvm.org/pipermail/llvm-dev/2021-July/151826.html) about
adding support for the [NVSHMEM](https://developer.nvidia.com/nvshmem) API
for GPU clusters to the NVPTX backend. There are no replies as of writing.

### Commits

*  Backends can now split register allocation into multiple `RegAlloc`
runs. This is implemented by giving each `RegAlloc` pass a callback which
decides if a register class should be handled or not. AMDGPU uses this to
allocate SGPRs and VGPRs separately. Interestingly, this had been in review
since December 2018. [D55301](https://reviews.llvm.org/D55301)
*  OpenCL gained support for new feature macros:
   *  `__opencl_c_generic_address_space` [D103401](
https://reviews.llvm.org/D103401)
   *  `__opencl_c_read_write_images` [D104915](
https://reviews.llvm.org/D104915)
   *  `__opencl_c_program_scope_global_variables` [D103191](
https://reviews.llvm.org/D103191)
*  The NVPTX matrix operation intrinsics were extended with the `.and.popc`
variant of the `b1` MMA instruction. [D105384](
https://reviews.llvm.org/D105384)
*  Waterfall loops are now marked as `SI_WATERFALL_LOOP` to aid AMDGPU
register allocation optimizations. [D105467](
https://reviews.llvm.org/D105467), [D105192](
https://reviews.llvm.org/D105192)
*  AMDGPU frontends can now generate export-free fragment shaders.
[D105683](https://reviews.llvm.org/D105683)
*  Some AMDGPU instructions are now marked as rematerializable:
   *  `SOP` [D105670](https://reviews.llvm.org/D105670)
   *  `VOP1` [D105742](https://reviews.llvm.org/D105742)
   *  `VOP2` [D106023](https://reviews.llvm.org/D106023)
   *  `VOP3` [D106110](https://reviews.llvm.org/D106110)
*  New `llvm-mca` [`CustomBehaviour` implementation for AMDGPU](
https://reviews.llvm.org/D104730) to handle the `s_waitcnt` instructions
landed but was later [reverted](
https://reviews.llvm.org/rGd38b9f1f31b1fa8ee885cfcd4ee7bd69771088c8).

## MLIR

### Discussions

*  chudur-budur asked about [making linalg.matmul to GPU runnable code](
https://llvm.discourse.group/t/making-linalg-matmul-to-gpu-runnable-code/3910)
that can be executed with the CUDA runtime. There are no replies as of
writing.

### Commits

## OpenMP (Target Offloading)

### Discussions

### Commits

*  The first GPU runtime specific call folding has been pushed [D105787](
https://reviews.llvm.org/D105787), though there are issues still to be
resolved right now. Folding of more runtime calls is already prepared
(e.g., [D106154](https://reviews.llvm.org/D106154), [D106033](
https://reviews.llvm.org/D106033)).
*  OpenMP-Opt will now emit more concise remarks with ID tags [D105939](
https://reviews.llvm.org/D105939). The OpenMP documentation does describe
these remarks in detail, with examples and mitigation strategies where
applicable [D106018](https://reviews.llvm.org/D106018).
*  SPMDization detection was extended to include
`AAHeapToStack`/`AAHeapToShared` analysis results [D105634](
https://reviews.llvm.org/D105634). This optimization, first introduced in
[D102307](https://reviews.llvm.org/D102307), transforms Generic-mode OpenMP
kernels that contain separate worker / main threads into an SPMD kernel
where all threads are worker threads and active throughout the entire
region. This is closer to how CUDA operates and is generally faster when we
can perform this optimization. The new patch allows it to handle
globalization. We're still missing a third optimization which guards
operations meant to be done by a single thread with a barrier before there
is full support. Even though full support is still in progress, we have
already seen good improvements from this transformation.

## External Compilers

### LLPC

### Mesa

*  David Airlie is working on bringing back the Mesa OpenCL 3.0 support and
[ran into issues with the SPIR target always enabling all CL optional
extensions](https://lists.llvm.org/pipermail/cfe-dev/2021-July/068522.html).
Anastasia Stulova noted that [SPIR was originally added as a device
agnostic-target](
https://lists.llvm.org/pipermail/cfe-dev/2021-July/068526.html). However,
for Mesa/clover, it is [desirable to provide Clang a concrete device](
https://lists.llvm.org/pipermail/cfe-dev/2021-July/068527.html) and the
list of supported extensions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210723/858277a3/attachment.html>