<div dir="ltr"><div>Hi,</div><div><br></div><div>Heterogeneous modules seem like an important feature when targeting accelerators.</div><div><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 27, 2020 at 11:01 PM Johannes Doerfert via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">TL;DR<br>

-----<br>

<br>

Let's allow to merge to LLVM-IR modules for different targets (with<br>

compatible data layouts) into a single LLVM-IR module to facilitate<br>

host-device code optimizations.<br></blockquote><div><br></div><div>I think the main question I have is with respect to this limitation on the datalayout: isn't it too limiting in practice?</div><div>I understand that this is much easier to implement in LLVM today, but it may get us into a fairly limited place in terms of what can be supported in the future.</div><div>Have you looked into what would it take to have heterogeneous modules that have their own DL?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

<br>

Wait, what?<br>

-----------<br>

<br>

Given an offloading programming model of your choice (CUDA, HIP, SYCL,<br>

OpenMP, OpenACC, ...), the current pipeline will most likely optimize<br>

the host and the device code in isolation. This is problematic as it<br>

makes everything from simple constant propagation to kernel<br>

splitting/fusion painfully hard. The proposal is to merge host and<br>

device code in a single module during the optimization steps. This<br>

should not induce any cost (if people don't use the functionality).<br>

<br>

<br>

But how do heterogeneous modules help?<br>

--------------------------------------<br>

<br>

Assuming we have heterogeneous LLVM-IR modules we can look at<br>

accelerator code optimization as an interprocedural optimization<br>

problem. You basically call the "kernel" but you cannot inline it. So<br>

you know the call site(s) and arguments, can propagate information back<br>

and forth (=constants, attributes, ...), and modify the call site as<br>

well as the kernel simultaneously, e.g., to split the kernel or fuse<br>

consecutive kernels. Without heterogeneous LLVM-IR modules we can do all<br>

of this, but require a lot more machinery. Given abstract call sites<br>

[0,1] and enabled interprocedural optimizations [2], host-device<br>

optimizations inside a heterogeneous module are really not (much)<br>

different than any other interprocedural optimization.<br>

<br>

[0] <a href="https://llvm.org/docs/LangRef.html#callback-metadata" rel="noreferrer" target="_blank">https://llvm.org/docs/LangRef.html#callback-metadata</a><br>

[1] <a href="https://youtu.be/zfiHaPaoQPc" rel="noreferrer" target="_blank">https://youtu.be/zfiHaPaoQPc</a><br>

[2] <a href="https://youtu.be/CzWkc_JcfS0" rel="noreferrer" target="_blank">https://youtu.be/CzWkc_JcfS0</a><br>

<br>

<br>

Where are the details?<br>

----------------------<br>

<br>

This is merely a proposal to get feedback. I talked to people before and<br>

got mixed results. I think this can be done in an "opt-in" way that is<br>

non-disruptive and without penalty. I sketched some ideas in [3] but<br>

*THIS IS NOT A PROPER PATCH*. If there is interest, I will provide more<br>

thoughts on design choices and potential problems. Since there is not<br>

much, I was hoping this would be a community effort from the very<br>

beginning :)<br>

<br>

[3] <a href="https://reviews.llvm.org/D84728" rel="noreferrer" target="_blank">https://reviews.llvm.org/D84728</a><br>

<br>

<br>

But MLIR, ...<br>

-------------<br>

<br>

I imagine MLIR can be used for this and there are probably good reasons<br>

to do so. We might not want to *only* to do it there with mainly the<br>

same arguments other things are still developed on LLVM-IR level. Feel<br>

free to ask though :)</blockquote><div> </div><div>(+1 : MLIR is not intended to be a reason to not improve LLVM!)<br></div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div></div></div>