[Openmp-dev] What changes to IR does late outlining require?

Mon Oct 28 13:00:17 PDT 2019

Just a short answer because I will soon(ish) write a longer RFC to the LLVM list on a related topic.
If people want, I'll share a draft of that RFC before it is send.

1) Outlining is beneficial as it avoids miscompilations in a pipeline that is not aware of parallelism.
2) While outlining alternatives exist, e.g., all late outlining approaches, it is:
  a) not clear they are sound, or how they can be made sound given that they are not by default
  b) not clear how much work it is to make them sound & optimize the code (opposing goals!)
  c) not clear if IPO + abstract call sites [0,1,2] will not provide us most of what we want anyway (so far that is the case IMHO)
3) To get around the host - device separation, thus the "outlining" into a different module, we will:
  a) need to come up with something new, none of the proposed approaches ever touched that
  b) need to allow "heterogeneous modules", I have a prototype I can make available if people are interested
4) We can already move constants between sequential and parallel openmp parts (for firstprivate + omp parallel)
    and moving any code is on my TODO list (see the LLVM Dev'19 IPO panel discussion!)

Cheers,
  Johannes

[0] https://www.youtube.com/watch?v=zfiHaPaoQPc
[1] https://www.youtube.com/watch?v=3AbS82C3X30
[2] https://llvm.org/devmtg/2019-10/talk-abstracts.html#tech24

________________________________________
From: Openmp-dev <openmp-dev-bounces at lists.llvm.org> on behalf of Jon Chesterfield via Openmp-dev <openmp-dev at lists.llvm.org>
Sent: Wednesday, October 23, 2019 09:19
To: via Openmp-dev
Subject: [Openmp-dev] What changes to IR does late outlining require?

I'm trying to determine why offload outlining is done in clang. As far back as 2012 or so the docs suggest that in IR would be better but doesn't work. A thesis by Novillo is cited in one that talks about making optimizations better, but doesn't obviously say why IR doesn't work for it / what modifications would be required.

There are challenges compiling gpu code under the ssa/simt model, but they're common to opencl/hip. I think we basically opt out of a lot of optimizations at present.

Moving memory ops across a target region won't work but that's manageable. Live in/out are roughly what we copy into the object for early outlining.

I'll ask this at the round table later, but any information the community can offer before that will be helpful.

Thanks!

Jon