<div dir="ltr">In fact, I have two modules:<div>a) the Host one </div><div>b) the Accelerator one </div><div><br></div><div>Each one gets compiled independently. The runtime takes care of the offloading operations and loads the accelerator code. Imagine that you want to compile for amd64 and nvidia ptx. You cannot do it in a single module and even if you support it, it is gonna become scary. How are you gonna handle architecture differences that affect the IR in a nice way? e.g. pointer size, stack alignment and much more...</div><div><br></div><div>--chris</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jun 6, 2015 at 12:30 PM, C Bergström <span dir="ltr"><<a href="mailto:cbergstrom@pathscale.com" target="_blank">cbergstrom@pathscale.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Sun, Jun 7, 2015 at 2:22 AM, Eric Christopher <<a href="mailto:echristo@gmail.com">echristo@gmail.com</a>> wrote:<br>

><br>

><br>

> On Sat, Jun 6, 2015 at 5:02 AM C Bergström <<a href="mailto:cbergstrom@pathscale.com">cbergstrom@pathscale.com</a>> wrote:<br>

>><br>

>> On Sat, Jun 6, 2015 at 6:24 PM, Christos Margiolas<br>

>> <<a href="mailto:chrmargiolas@gmail.com">chrmargiolas@gmail.com</a>> wrote:<br>

>> > Hello,<br>

>> ><br>

>> > Thank you a lot for the feedback. I believe that the heterogeneous<br>

>> > engine<br>

>> > should be strongly connected with parallelization and vectorization<br>

>> > efforts.<br>

>> > Most of the accelerators are parallel architectures where having<br>

>> > efficient<br>

>> > parallelization and vectorization can be critical for performance.<br>

>> ><br>

>> > I am interested in these efforts and I hope that my code can help you<br>

>> > managing the offloading operations. Your LLVM instruction set extensions<br>

>> > may<br>

>> > require some changes in the analysis code but I think is going to be<br>

>> > straightforward.<br>

>> ><br>

>> > I am planning to push my code on phabricator in the next days.<br>

>><br>

>> If you're doing the extracting at the loop and llvm ir level - why<br>

>> would you need to modify the IR? Wouldn't the target level lowering<br>

>> happen later?<br>

>><br>

>> How are you actually determining to offload? Is this tied to<br>

>> directives or using heuristics+some set of restrictions?<br>

>><br>

>> Lastly, are you handling 2 targets in the same module or end up<br>

>> emitting 2 modules and dealing with recombining things later..<br>

>><br>

><br>

> It's not currently possible to do this using the current structure without<br>

> some significant and, honestly, icky patches.<br>

<br>

</div></div>What's not possible? I agree some of our local patches and design may<br>

not make it upstream as-is, but we are offloading to 2+ targets using<br>

llvm ir *today*.<br>

<br>

IMHO - you must (re)solve the problem about handling multiple targets<br>

concurrently. That means 2 targets in a single Module or 2 Modules<br>

basically glued one after the other.<br>

</blockquote></div><br></div>